<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>People will specify which partition they need or if they want

      multiple they use this:<br>

    </p>

    <p>#SBATCH -p general,shared,serial_requeue</p>

    <p>As then the scheduler will just select which partition they will

      run in first.  Naturally there is a risk that you will end up

      running in a more expensive partition.</p>

    <p>Our time limit is only applied to our public partitions, our

      owned partitions (of which we have roughly 80) have no time

      limit.  So if they run on their dedicated resources they have no

      penalty.  We've been working on getting rid of owned partitions

      and moving to a school/department based partition, where all the

      purchased resources for different PI's go into the same bucket

      where they compete against themselves and not the wider

      community.  We've found that this ends up working pretty well as

      most PI's only used their purchased resources sporadically.  Thus

      there are usually idle cores lying around that we backfill with

      our serial queues.  Since those are requeueable we can get

      immediate response to access that idle space.  We are also toying

      with a high priority partition that is open to people with high

      fairshare so that they can get immediate response as those with

      high fairshare tend to be bursty users.</p>

    <p>Our current halflife is set to a month and we keep 6 months of

      data in our database.  I'd actually like to get rid of the

      halflife and just go to a 3 month moving window to allow people to

      bank their fairshare, but we haven't done that yet as people have

      been having a hard enough time understanding our current system. 

      It's not due to its complexity but more that most people just flat

      out aren't cognizant of their usage and think the resource is

      functionally infinite.</p>

    <p>-Paul Edmon-<br>

    </p>

    <div class="moz-cite-prefix">On 6/19/19 5:16 PM, Fulcomer, Samuel

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAOORAuGuxBy-uZEv2yE7EDj--hTj=sqmfh3w8ifP-xgtvbq-LA@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">Hi Paul,

          <div><br>

          </div>

          <div>Thanks..Your setup is interesting. I see that you have

            your processor types segregated in their own partitions

            (with the exception of of the requeue partition), and that's

            how you get at the weighting mechanism. Do you have your

            users explicitly specify multiple partitions in the batch

            commands/scripts in order to take advantage of this, or do

            you use a plugin for it?</div>

          <div><br>

          </div>

          <div>It sounds like you don't impose any hard limit on

            simultaneous resource use, and allow everything to fairshare

            out with the help of the 7 day TimeLimit. We haven't been

            imposing any TimeLimit on our condo users, which would be an

            issue for us with your config. For our exploratory and

            priority users, we impose an effective time limit with

            GrpTRESRunMins=cpu (and gres/gpu= for the GPU usage). In

            addition, since we have so many priority users, we don't

            explicitly set a rawshare value for them (they all execute

            under the "default" account). We set rawshare for the condo

            accounts as cores-purchased/total-cores*1000. </div>

          <div><br>

          </div>

          <div>What's your fairshare decay setting (don't remember the

            proper name at the moment)?</div>

          <div><br>

          </div>

          <div>Regards,</div>

          <div>Sam</div>

          <div><br>

          </div>

          <div><br>

          </div>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Wed, Jun 19, 2019 at 3:44

            PM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"

              moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div bgcolor="#FFFFFF">

              <p>We do a similar thing here at Harvard:</p>

              <p><a

                  class="gmail-m_8457408054565706666moz-txt-link-freetext"

                  href="https://www.rc.fas.harvard.edu/fairshare/"

                  target="_blank" moz-do-not-send="true">https://www.rc.fas.harvard.edu/fairshare/</a></p>

              <p>We simply weight all the partitions based on their core

                type and then we allocate Shares for each account based

                on what they have purchased.  We don't use QoS at all,

                so we just rely purely on fairshare weighting for

                resource usage.  It has worked pretty well for our

                purposes.</p>

              <p>-Paul Edmon-<br>

              </p>

              <div class="gmail-m_8457408054565706666moz-cite-prefix">On

                6/19/19 3:30 PM, Fulcomer, Samuel wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr"><br>

                  <div>(...and yes, the name is inspired by a certain

                    OEM's software licensing schemes...)</div>

                  <div><br>

                  </div>

                  <div>At Brown we run a ~400 node cluster containing

                    nodes of multiple architectures (Sandy/Ivy,

                    Haswell/Broadwell, and Sky/Cascade) purchased in

                    some cases by University funds and in others by

                    investigator funding (~50:50).  They all appear in

                    the default SLURM partition. We have 3 classes of

                    SLURM users:</div>

                  <div><br>

                  </div>

                  <div>

                    <ol>

                      <li>Exploratory - no-charge access to up to 16

                        cores</li>

                      <li>Priority - $750/quarter for access to up to

                        192 cores (and with a GrpTRESRunMins=cpu limit).

                        Each user has their own QoS</li>

                      <li>Condo - an investigator group who paid for

                        nodes added to the cluster. The group has its

                        own QoS and SLURM Account. The QoS allows use of

                        the number of cores purchased and has a much

                        higher priority than the QoS' of the "priority"

                        users.</li>

                    </ol>

                    <div>The first problem with this scheme is that

                      condo users who have purchased the older hardware

                      now have access to the newest without penalty. In

                      addition, we're encountering resistance to the

                      idea of turning off their hardware and terminating

                      their condos (despite MOUs stating a 5yr life).

                      The pushback is the stated belief that the

                      hardware should run until it dies.</div>

                  </div>

                  <div><br>

                  </div>

                  <div>What I propose is a new TRES called a Processor

                    Performance Unit (PPU) that would be specified on

                    the Node line in slurm.conf, and used such that

                    GrpTRES=ppu=N was calculated as the number of

                    allocated cores multiplied by their associated PPU

                    numbers.</div>

                  <div><br>

                  </div>

                  <div>We could then assign a base PPU to the oldest

                    hardware, say, "1" for Sandy/Ivy and increase for

                    later architectures based on performance

                    improvement. We'd set the condo QoS to

                    GrpTRES=ppu=N*X+M*Y,..., where N is the number of

                    cores of the oldest architecture multiplied by the

                    configured PPU/core, X, and repeat for any newer

                    nodes/cores the investigator has purchased since.</div>

                  <div><br>

                  </div>

                  <div>The result is that the investigator group gets to

                    run on an approximation of the performance that

                    they've purchased, rather on the raw purchased core

                    count.</div>

                  <div><br>

                  </div>

                  <div>Thoughts?</div>

                  <div><br>

                  </div>

                  <div><br>

                  </div>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

  </body>

</html>