<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:monospace">Yes, we do have time limits set on partitions- 7 days maximum, 3 days default.  In this case, the larger job is requesting 3 days of walltime, the smaller jobs are requesting 7.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">Thanks</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">M</div><input name="virtru-metadata" type="hidden" value="{"email-policy":{"state":"closed","expirationUnit":"days","disableCopyPaste":false,"disablePrint":false,"disableForwarding":false,"enableNoauth":false,"expires":false,"isManaged":false},"attachments":{},"compose-id":"7","compose-window":{"secure":false}}"></div><br><div class="gmail_quote" style=""><div dir="ltr" class="gmail_attr">On Wed, Feb 27, 2019 at 12:41 PM Andy Riebs <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    Michael, are you setting time limits for the jobs? That's a huge

    part of a scheduler's decision about whether another job can be run.

    For example, if a job is running with the Slurm default of

    "infinite," the scheduler will likely decide that jobs that will fit

    in the remaining nodes will be able to finish before the job that

    requires infinite time.<br>

    <br>

    Andy<br>

    <br>

    <div id="gmail-m_1191102331998785307smartTemplate4-quoteHeader">

      <hr> <b>From:</b> Michael Gutteridge

      <a class="gmail-m_1191102331998785307moz-txt-link-rfc2396E" href="mailto:michael.gutteridge@gmail.com" target="_blank"><michael.gutteridge@gmail.com></a> <br>

      <b>Sent:</b> Wednesday, February 27, 2019 3:29PM <br>

      <b>To:</b> Slurm User Community List

      <a class="gmail-m_1191102331998785307moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com" target="_blank"><slurm-users@lists.schedmd.com></a><br>

      <b>Cc:</b> <br>

      <b>Subject:</b> [slurm-users] Large job starvation on cloud

      cluster <br>

    </div>

    <div class="gmail-m_1191102331998785307replaced-blockquote" type="cite">

      <div dir="ltr">

        <div dir="ltr">

          <div dir="ltr">

            <div dir="ltr">

              <div style="font-family:monospace">I've

                run into a problem with a cluster we've got in a cloud

                provider- hoping someone might have some advice.</div>

              <div style="font-family:monospace"><br>

              </div>

              <div style="font-family:monospace">The

                problem is that I've got a circumstance where large jobs

                _never_ start... or more correctly, that large-er jobs

                don't start when there are many smaller jobs in the

                partition.  In this cluster, accounts are limited to 300

                cores.  One user has submitted a couple thousand jobs

                that each use 6 cores.  These queue up, start nodes, and

                eventually all 300 cores in the limit are busy and the

                remaining jobs are held with "AssocGrpCpuLimit".  All as

                expected.</div>

              <div style="font-family:monospace"><br>

              </div>

              <div style="font-family:monospace">Then

                another user submits a job requesting 16 cores.  This

                one, too, gets held with the same reason.  However, that

                larger job never starts even if it has the highest

                priority of jobs in this account (I've set it manually

                and by using nice).</div>

              <div style="font-family:monospace"><br>

              </div>

              <div style="font-family:monospace">What

                I see in the sched.log is:</div>

              <div style="font-family:monospace"><br>

              </div>

              <div>

                <div style="font-family:monospace">sched:

                  [2019-02-25T16:00:14.940] Running job scheduler</div>

                <div style="font-family:monospace">sched:

                  [2019-02-25T16:00:14.941] JobId=2210784 delayed for

                  accounting policy</div>

                <div style="font-family:monospace">sched:

                  [2019-02-25T16:00:14.942] JobId=2203130 initiated</div>

                <div style="font-family:monospace">sched:

                  [2019-02-25T16:00:14.942] Allocate JobId=2203130

                  NodeList=node1 #CPUs=6 Partition=largenode</div>

                <div style="font-family:monospace"><br>

                </div>

                <div style="font-family:monospace">In this case, 2210784

                  is the job requesting 16 cores and 2203130 is one of

                  the six core jobs.  This seems to happen with either

                  the backfill or builtin scheduler.  I suspect what's

                  happening is that when one of the smaller jobs

                  completes, the scheduler first looks at the

                  higher-priority large job, determines that it cannot

                  run because of the constraint, looks at the next job

                  in the list, determines that it can run without

                  exceeding the limit, and then starts that job.  In

                  this way, the larger job isn't started until all of

                  these smaller jobs complete.</div>

                <div style="font-family:monospace"><br>

                </div>

                <div style="font-family:monospace">I thought that

                  switching to the builtin scheduler would fix this, but

                  as slurm.conf(5) indicates:</div>

                <div style="font-family:monospace"><br>

                </div>

                <div>

                  <div><font face="monospace">> An </font><span style="font-family:monospace">exception is made

                      for jobs that can not run due </span></div>

                  <div><span style="font-family:monospace">> to

                      partition </span><span style="font-family:monospace">constraints (e.g.

                      the time limit) or </span></div>

                  <div><span style="font-family:monospace">>

                      down/drained nodes.  In </span><span style="font-family:monospace">that case, lower

                      priority </span></div>

                  <div><span style="font-family:monospace">> jobs can

                      be initiated and not </span><span style="font-family:monospace">impact the higher </span></div>

                  <div><span style="font-family:monospace">> priority

                      job.</span></div>

                </div>

                <div style="font-family:monospace"><br>

                </div>

                <div style="font-family:monospace">I suspect one of

                  these exceptions is being triggered- the limit is in

                  the job's association, so I don't think it's a

                  partition constraint.  I don't have this problem with

                  the on-premises cluster, so I suspect it's something

                  to do with power management and the state of

                  powered-down nodes.</div>

                <div style="font-family:monospace"><br>

                </div>

                <div style="font-family:monospace">I've sort-of worked

                  around this by setting a per-user limit lower than the

                  per-account limit, but that doesn't address any

                  situation where a single user submits large and small

                  jobs and does lead to some other problems in other

                  groups, so it's not a long-term solution.</div>

                <div style="font-family:monospace"><br>

                </div>

                <div style="font-family:monospace">Thanks for having a

                  look</div>

                <div style="font-family:monospace"><br>

                </div>

                <div style="font-family:monospace"> - Michael</div>

                <div style="font-family:monospace"><br>

                </div>

              </div>

              </div>

          </div>

        </div>

      </div>

    </div>

    <br>

  </div>

</blockquote></div></div>