<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>We don't do anything.  In our environment it is the user's

      responsibility to optimize their code appropriately.  Since we

      have a great variety of hardware any modules we build (we have

      several thousand of them) are all build generically.  If people

      want processor specific optimizations then they have to build

      their own stack.</p>

    <p>-Paul Edmon-<br>

    </p>

    <div class="moz-cite-prefix">On 6/20/19 11:07 AM, Fulcomer, Samuel

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAOORAuEt589vMW=2R-mx-=xC9mhNEgiFrhrFT6FnNpZrz7JhfA@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">...ah, got it. I was confused by "PI/Lab nodes"

          in your partition list.</div>

        <div dir="ltr"><br>

        </div>

        <div>Our QoS/account pair for each investigator condo is our

          approximate equivalent of what you're doing with owned

          partitions. </div>

        <div><br>

        </div>

        <div>Since we have everything in one partition we segregate

          processor types via topology.conf. We break up topology.conf

          further to keep MPI jobs on the same switch.</div>

        <div><br>

        </div>

        <div>On another topic, how do you address code optimization for

          processor type? We've been mostly linking with MKL and relying

          on its muti-code-path. </div>

        <div><br>

        </div>

        <div>Regards,</div>

        <div>Sam</div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Thu, Jun 20, 2019 at

            10:20 AM Paul Edmon <<a

              href="mailto:pedmon@cfa.harvard.edu"

              moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div bgcolor="#FFFFFF">

              <p>People will specify which partition they need or if

                they want multiple they use this:<br>

              </p>

              <p>#SBATCH -p general,shared,serial_requeue</p>

              <p>As then the scheduler will just select which partition

                they will run in first.  Naturally there is a risk that

                you will end up running in a more expensive partition.</p>

              <p>Our time limit is only applied to our public

                partitions, our owned partitions (of which we have

                roughly 80) have no time limit.  So if they run on their

                dedicated resources they have no penalty.  We've been

                working on getting rid of owned partitions and moving to

                a school/department based partition, where all the

                purchased resources for different PI's go into the same

                bucket where they compete against themselves and not the

                wider community.  We've found that this ends up working

                pretty well as most PI's only used their purchased

                resources sporadically.  Thus there are usually idle

                cores lying around that we backfill with our serial

                queues.  Since those are requeueable we can get

                immediate response to access that idle space.  We are

                also toying with a high priority partition that is open

                to people with high fairshare so that they can get

                immediate response as those with high fairshare tend to

                be bursty users.</p>

              <p>Our current halflife is set to a month and we keep 6

                months of data in our database.  I'd actually like to

                get rid of the halflife and just go to a 3 month moving

                window to allow people to bank their fairshare, but we

                haven't done that yet as people have been having a hard

                enough time understanding our current system.  It's not

                due to its complexity but more that most people just

                flat out aren't cognizant of their usage and think the

                resource is functionally infinite.</p>

              <p>-Paul Edmon-<br>

              </p>

              <div class="gmail-m_-2295429921239604436moz-cite-prefix">On

                6/19/19 5:16 PM, Fulcomer, Samuel wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div dir="ltr">Hi Paul,

                    <div><br>

                    </div>

                    <div>Thanks..Your setup is interesting. I see that

                      you have your processor types segregated in their

                      own partitions (with the exception of of the

                      requeue partition), and that's how you get at the

                      weighting mechanism. Do you have your users

                      explicitly specify multiple partitions in the

                      batch commands/scripts in order to take advantage

                      of this, or do you use a plugin for it?</div>

                    <div><br>

                    </div>

                    <div>It sounds like you don't impose any hard limit

                      on simultaneous resource use, and allow everything

                      to fairshare out with the help of the 7 day

                      TimeLimit. We haven't been imposing any TimeLimit

                      on our condo users, which would be an issue for us

                      with your config. For our exploratory and priority

                      users, we impose an effective time limit with

                      GrpTRESRunMins=cpu (and gres/gpu= for the GPU

                      usage). In addition, since we have so many

                      priority users, we don't explicitly set a rawshare

                      value for them (they all execute under the

                      "default" account). We set rawshare for the condo

                      accounts as cores-purchased/total-cores*1000. </div>

                    <div><br>

                    </div>

                    <div>What's your fairshare decay setting (don't

                      remember the proper name at the moment)?</div>

                    <div><br>

                    </div>

                    <div>Regards,</div>

                    <div>Sam</div>

                    <div><br>

                    </div>

                    <div><br>

                    </div>

                  </div>

                  <br>

                  <div class="gmail_quote">

                    <div dir="ltr" class="gmail_attr">On Wed, Jun 19,

                      2019 at 3:44 PM Paul Edmon <<a

                        href="mailto:pedmon@cfa.harvard.edu"

                        target="_blank" moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>

                      wrote:<br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex">

                      <div bgcolor="#FFFFFF">

                        <p>We do a similar thing here at Harvard:</p>

                        <p><a

class="gmail-m_-2295429921239604436gmail-m_8457408054565706666moz-txt-link-freetext"

href="https://www.rc.fas.harvard.edu/fairshare/" target="_blank"

                            moz-do-not-send="true">https://www.rc.fas.harvard.edu/fairshare/</a></p>

                        <p>We simply weight all the partitions based on

                          their core type and then we allocate Shares

                          for each account based on what they have

                          purchased.  We don't use QoS at all, so we

                          just rely purely on fairshare weighting for

                          resource usage.  It has worked pretty well for

                          our purposes.</p>

                        <p>-Paul Edmon-<br>

                        </p>

                        <div

class="gmail-m_-2295429921239604436gmail-m_8457408054565706666moz-cite-prefix">On

                          6/19/19 3:30 PM, Fulcomer, Samuel wrote:<br>

                        </div>

                        <blockquote type="cite">

                          <div dir="ltr"><br>

                            <div>(...and yes, the name is inspired by a

                              certain OEM's software licensing

                              schemes...)</div>

                            <div><br>

                            </div>

                            <div>At Brown we run a ~400 node cluster

                              containing nodes of multiple architectures

                              (Sandy/Ivy, Haswell/Broadwell, and

                              Sky/Cascade) purchased in some cases by

                              University funds and in others by

                              investigator funding (~50:50).  They all

                              appear in the default SLURM partition. We

                              have 3 classes of SLURM users:</div>

                            <div><br>

                            </div>

                            <div>

                              <ol>

                                <li>Exploratory - no-charge access to up

                                  to 16 cores</li>

                                <li>Priority - $750/quarter for access

                                  to up to 192 cores (and with a

                                  GrpTRESRunMins=cpu limit). Each user

                                  has their own QoS</li>

                                <li>Condo - an investigator group who

                                  paid for nodes added to the cluster.

                                  The group has its own QoS and SLURM

                                  Account. The QoS allows use of the

                                  number of cores purchased and has a

                                  much higher priority than the QoS' of

                                  the "priority" users.</li>

                              </ol>

                              <div>The first problem with this scheme is

                                that condo users who have purchased the

                                older hardware now have access to the

                                newest without penalty. In addition,

                                we're encountering resistance to the

                                idea of turning off their hardware and

                                terminating their condos (despite MOUs

                                stating a 5yr life). The pushback is the

                                stated belief that the hardware should

                                run until it dies.</div>

                            </div>

                            <div><br>

                            </div>

                            <div>What I propose is a new TRES called a

                              Processor Performance Unit (PPU) that

                              would be specified on the Node line in

                              slurm.conf, and used such that

                              GrpTRES=ppu=N was calculated as the number

                              of allocated cores multiplied by their

                              associated PPU numbers.</div>

                            <div><br>

                            </div>

                            <div>We could then assign a base PPU to the

                              oldest hardware, say, "1" for Sandy/Ivy

                              and increase for later architectures based

                              on performance improvement. We'd set the

                              condo QoS to GrpTRES=ppu=N*X+M*Y,...,

                              where N is the number of cores of the

                              oldest architecture multiplied by the

                              configured PPU/core, X, and repeat for any

                              newer nodes/cores the investigator has

                              purchased since.</div>

                            <div><br>

                            </div>

                            <div>The result is that the investigator

                              group gets to run on an approximation of

                              the performance that they've purchased,

                              rather on the raw purchased core count.</div>

                            <div><br>

                            </div>

                            <div>Thoughts?</div>

                            <div><br>

                            </div>

                            <div><br>

                            </div>

                          </div>

                        </blockquote>

                      </div>

                    </blockquote>

                  </div>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

  </body>

</html>