<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body smarttemplateinserted="true">
    Michael, are you setting time limits for the jobs? That's a huge
    part of a scheduler's decision about whether another job can be run.
    For example, if a job is running with the Slurm default of
    "infinite," the scheduler will likely decide that jobs that will fit
    in the remaining nodes will be able to finish before the job that
    requires infinite time.<br>
    <br>
    Andy<br>
    <br>
    <div id="smartTemplate4-quoteHeader">
      <hr> <b>From:</b> Michael Gutteridge
      <a class="moz-txt-link-rfc2396E" href="mailto:michael.gutteridge@gmail.com"><michael.gutteridge@gmail.com></a> <br>
      <b>Sent:</b> Wednesday, February 27, 2019 3:29PM <br>
      <b>To:</b> Slurm User Community List
      <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
      <b>Cc:</b> <br>
      <b>Subject:</b> [slurm-users] Large job starvation on cloud
      cluster <br>
    </div>
    <div class="replaced-blockquote"
cite="mid:CALUL84tTf_Nq8V8a=LZXnXpb2iWrwFmi650fOX62ZLW2w1cTow@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">
          <div dir="ltr">
            <div dir="ltr">
              <div class="gmail_default" style="font-family:monospace">I've
                run into a problem with a cluster we've got in a cloud
                provider- hoping someone might have some advice.</div>
              <div class="gmail_default" style="font-family:monospace"><br>
              </div>
              <div class="gmail_default" style="font-family:monospace">The
                problem is that I've got a circumstance where large jobs
                _never_ start... or more correctly, that large-er jobs
                don't start when there are many smaller jobs in the
                partition.  In this cluster, accounts are limited to 300
                cores.  One user has submitted a couple thousand jobs
                that each use 6 cores.  These queue up, start nodes, and
                eventually all 300 cores in the limit are busy and the
                remaining jobs are held with "AssocGrpCpuLimit".  All as
                expected.</div>
              <div class="gmail_default" style="font-family:monospace"><br>
              </div>
              <div class="gmail_default" style="font-family:monospace">Then
                another user submits a job requesting 16 cores.  This
                one, too, gets held with the same reason.  However, that
                larger job never starts even if it has the highest
                priority of jobs in this account (I've set it manually
                and by using nice).</div>
              <div class="gmail_default" style="font-family:monospace"><br>
              </div>
              <div class="gmail_default" style="font-family:monospace">What
                I see in the sched.log is:</div>
              <div class="gmail_default" style="font-family:monospace"><br>
              </div>
              <div class="gmail_default">
                <div class="gmail_default" style="font-family:monospace">sched:
                  [2019-02-25T16:00:14.940] Running job scheduler</div>
                <div class="gmail_default" style="font-family:monospace">sched:
                  [2019-02-25T16:00:14.941] JobId=2210784 delayed for
                  accounting policy</div>
                <div class="gmail_default" style="font-family:monospace">sched:
                  [2019-02-25T16:00:14.942] JobId=2203130 initiated</div>
                <div class="gmail_default" style="font-family:monospace">sched:
                  [2019-02-25T16:00:14.942] Allocate JobId=2203130
                  NodeList=node1 #CPUs=6 Partition=largenode</div>
                <div style="font-family:monospace"><br>
                </div>
                <div style="font-family:monospace">In this case, 2210784
                  is the job requesting 16 cores and 2203130 is one of
                  the six core jobs.  This seems to happen with either
                  the backfill or builtin scheduler.  I suspect what's
                  happening is that when one of the smaller jobs
                  completes, the scheduler first looks at the
                  higher-priority large job, determines that it cannot
                  run because of the constraint, looks at the next job
                  in the list, determines that it can run without
                  exceeding the limit, and then starts that job.  In
                  this way, the larger job isn't started until all of
                  these smaller jobs complete.</div>
                <div style="font-family:monospace"><br>
                </div>
                <div style="font-family:monospace">I thought that
                  switching to the builtin scheduler would fix this, but
                  as slurm.conf(5) indicates:</div>
                <div style="font-family:monospace"><br>
                </div>
                <div>
                  <div><font face="monospace">> An </font><span
                      style="font-family:monospace">exception is made
                      for jobs that can not run due </span></div>
                  <div><span style="font-family:monospace">> to
                      partition </span><span
                      style="font-family:monospace">constraints (e.g.
                      the time limit) or </span></div>
                  <div><span style="font-family:monospace">>
                      down/drained nodes.  In </span><span
                      style="font-family:monospace">that case, lower
                      priority </span></div>
                  <div><span style="font-family:monospace">> jobs can
                      be initiated and not </span><span
                      style="font-family:monospace">impact the higher </span></div>
                  <div><span style="font-family:monospace">> priority
                      job.</span></div>
                </div>
                <div style="font-family:monospace"><br>
                </div>
                <div style="font-family:monospace">I suspect one of
                  these exceptions is being triggered- the limit is in
                  the job's association, so I don't think it's a
                  partition constraint.  I don't have this problem with
                  the on-premises cluster, so I suspect it's something
                  to do with power management and the state of
                  powered-down nodes.</div>
                <div style="font-family:monospace"><br>
                </div>
                <div style="font-family:monospace">I've sort-of worked
                  around this by setting a per-user limit lower than the
                  per-account limit, but that doesn't address any
                  situation where a single user submits large and small
                  jobs and does lead to some other problems in other
                  groups, so it's not a long-term solution.</div>
                <div style="font-family:monospace"><br>
                </div>
                <div style="font-family:monospace">Thanks for having a
                  look</div>
                <div style="font-family:monospace"><br>
                </div>
                <div style="font-family:monospace"> - Michael</div>
                <div style="font-family:monospace"><br>
                </div>
              </div>
              <input name="virtru-metadata"
value="{"email-policy":{"state":"closed","expirationUnit":"days","disableCopyPaste":false,"disablePrint":false,"disableForwarding":false,"enableNoauth":false,"expires":false,"isManaged":false},"attachments":{},"compose-id":"5","compose-window":{"secure":false}}"
                type="hidden"></div>
          </div>
        </div>
      </div>
    </div>
    <br>
  </body>
</html>