<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>We don't do anything.  In our environment it is the user's
      responsibility to optimize their code appropriately.  Since we
      have a great variety of hardware any modules we build (we have
      several thousand of them) are all build generically.  If people
      want processor specific optimizations then they have to build
      their own stack.</p>
    <p>-Paul Edmon-<br>
    </p>
    <div class="moz-cite-prefix">On 6/20/19 11:07 AM, Fulcomer, Samuel
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAOORAuEt589vMW=2R-mx-=xC9mhNEgiFrhrFT6FnNpZrz7JhfA@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">...ah, got it. I was confused by "PI/Lab nodes"
          in your partition list.</div>
        <div dir="ltr"><br>
        </div>
        <div>Our QoS/account pair for each investigator condo is our
          approximate equivalent of what you're doing with owned
          partitions. </div>
        <div><br>
        </div>
        <div>Since we have everything in one partition we segregate
          processor types via topology.conf. We break up topology.conf
          further to keep MPI jobs on the same switch.</div>
        <div><br>
        </div>
        <div>On another topic, how do you address code optimization for
          processor type? We've been mostly linking with MKL and relying
          on its muti-code-path. </div>
        <div><br>
        </div>
        <div>Regards,</div>
        <div>Sam</div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Thu, Jun 20, 2019 at
            10:20 AM Paul Edmon <<a
              href="mailto:pedmon@cfa.harvard.edu"
              moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF">
              <p>People will specify which partition they need or if
                they want multiple they use this:<br>
              </p>
              <p>#SBATCH -p general,shared,serial_requeue</p>
              <p>As then the scheduler will just select which partition
                they will run in first.  Naturally there is a risk that
                you will end up running in a more expensive partition.</p>
              <p>Our time limit is only applied to our public
                partitions, our owned partitions (of which we have
                roughly 80) have no time limit.  So if they run on their
                dedicated resources they have no penalty.  We've been
                working on getting rid of owned partitions and moving to
                a school/department based partition, where all the
                purchased resources for different PI's go into the same
                bucket where they compete against themselves and not the
                wider community.  We've found that this ends up working
                pretty well as most PI's only used their purchased
                resources sporadically.  Thus there are usually idle
                cores lying around that we backfill with our serial
                queues.  Since those are requeueable we can get
                immediate response to access that idle space.  We are
                also toying with a high priority partition that is open
                to people with high fairshare so that they can get
                immediate response as those with high fairshare tend to
                be bursty users.</p>
              <p>Our current halflife is set to a month and we keep 6
                months of data in our database.  I'd actually like to
                get rid of the halflife and just go to a 3 month moving
                window to allow people to bank their fairshare, but we
                haven't done that yet as people have been having a hard
                enough time understanding our current system.  It's not
                due to its complexity but more that most people just
                flat out aren't cognizant of their usage and think the
                resource is functionally infinite.</p>
              <p>-Paul Edmon-<br>
              </p>
              <div class="gmail-m_-2295429921239604436moz-cite-prefix">On
                6/19/19 5:16 PM, Fulcomer, Samuel wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr">Hi Paul,
                    <div><br>
                    </div>
                    <div>Thanks..Your setup is interesting. I see that
                      you have your processor types segregated in their
                      own partitions (with the exception of of the
                      requeue partition), and that's how you get at the
                      weighting mechanism. Do you have your users
                      explicitly specify multiple partitions in the
                      batch commands/scripts in order to take advantage
                      of this, or do you use a plugin for it?</div>
                    <div><br>
                    </div>
                    <div>It sounds like you don't impose any hard limit
                      on simultaneous resource use, and allow everything
                      to fairshare out with the help of the 7 day
                      TimeLimit. We haven't been imposing any TimeLimit
                      on our condo users, which would be an issue for us
                      with your config. For our exploratory and priority
                      users, we impose an effective time limit with
                      GrpTRESRunMins=cpu (and gres/gpu= for the GPU
                      usage). In addition, since we have so many
                      priority users, we don't explicitly set a rawshare
                      value for them (they all execute under the
                      "default" account). We set rawshare for the condo
                      accounts as cores-purchased/total-cores*1000. </div>
                    <div><br>
                    </div>
                    <div>What's your fairshare decay setting (don't
                      remember the proper name at the moment)?</div>
                    <div><br>
                    </div>
                    <div>Regards,</div>
                    <div>Sam</div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">On Wed, Jun 19,
                      2019 at 3:44 PM Paul Edmon <<a
                        href="mailto:pedmon@cfa.harvard.edu"
                        target="_blank" moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex">
                      <div bgcolor="#FFFFFF">
                        <p>We do a similar thing here at Harvard:</p>
                        <p><a
class="gmail-m_-2295429921239604436gmail-m_8457408054565706666moz-txt-link-freetext"
href="https://www.rc.fas.harvard.edu/fairshare/" target="_blank"
                            moz-do-not-send="true">https://www.rc.fas.harvard.edu/fairshare/</a></p>
                        <p>We simply weight all the partitions based on
                          their core type and then we allocate Shares
                          for each account based on what they have
                          purchased.  We don't use QoS at all, so we
                          just rely purely on fairshare weighting for
                          resource usage.  It has worked pretty well for
                          our purposes.</p>
                        <p>-Paul Edmon-<br>
                        </p>
                        <div
class="gmail-m_-2295429921239604436gmail-m_8457408054565706666moz-cite-prefix">On
                          6/19/19 3:30 PM, Fulcomer, Samuel wrote:<br>
                        </div>
                        <blockquote type="cite">
                          <div dir="ltr"><br>
                            <div>(...and yes, the name is inspired by a
                              certain OEM's software licensing
                              schemes...)</div>
                            <div><br>
                            </div>
                            <div>At Brown we run a ~400 node cluster
                              containing nodes of multiple architectures
                              (Sandy/Ivy, Haswell/Broadwell, and
                              Sky/Cascade) purchased in some cases by
                              University funds and in others by
                              investigator funding (~50:50).  They all
                              appear in the default SLURM partition. We
                              have 3 classes of SLURM users:</div>
                            <div><br>
                            </div>
                            <div>
                              <ol>
                                <li>Exploratory - no-charge access to up
                                  to 16 cores</li>
                                <li>Priority - $750/quarter for access
                                  to up to 192 cores (and with a
                                  GrpTRESRunMins=cpu limit). Each user
                                  has their own QoS</li>
                                <li>Condo - an investigator group who
                                  paid for nodes added to the cluster.
                                  The group has its own QoS and SLURM
                                  Account. The QoS allows use of the
                                  number of cores purchased and has a
                                  much higher priority than the QoS' of
                                  the "priority" users.</li>
                              </ol>
                              <div>The first problem with this scheme is
                                that condo users who have purchased the
                                older hardware now have access to the
                                newest without penalty. In addition,
                                we're encountering resistance to the
                                idea of turning off their hardware and
                                terminating their condos (despite MOUs
                                stating a 5yr life). The pushback is the
                                stated belief that the hardware should
                                run until it dies.</div>
                            </div>
                            <div><br>
                            </div>
                            <div>What I propose is a new TRES called a
                              Processor Performance Unit (PPU) that
                              would be specified on the Node line in
                              slurm.conf, and used such that
                              GrpTRES=ppu=N was calculated as the number
                              of allocated cores multiplied by their
                              associated PPU numbers.</div>
                            <div><br>
                            </div>
                            <div>We could then assign a base PPU to the
                              oldest hardware, say, "1" for Sandy/Ivy
                              and increase for later architectures based
                              on performance improvement. We'd set the
                              condo QoS to GrpTRES=ppu=N*X+M*Y,...,
                              where N is the number of cores of the
                              oldest architecture multiplied by the
                              configured PPU/core, X, and repeat for any
                              newer nodes/cores the investigator has
                              purchased since.</div>
                            <div><br>
                            </div>
                            <div>The result is that the investigator
                              group gets to run on an approximation of
                              the performance that they've purchased,
                              rather on the raw purchased core count.</div>
                            <div><br>
                            </div>
                            <div>Thoughts?</div>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                </div>
              </blockquote>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
  </body>
</html>