<div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:monospace">It's the association (account) limit.  The problem being that lower priority jobs were backfilling (even with the builtin scheduler) around this larger job preventing it from running.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I have found what looks like the solution.  I need to switch to the builtin scheduler and add "assoc_limit_stop" to "SchedulerParameters".  From slurm.conf(5):</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace"><div class="gmail_default">assoc_limit_stop</div><div class="gmail_default">     If  set  and  a  job cannot start due to association limits,</div><div class="gmail_default">     then do not attempt to initiate any lower priority  jobs  in</div><div class="gmail_default">     that  partition. Setting this can decrease system throughput</div><div class="gmail_default">     and utilization, but avoid potentially starving larger  jobs</div><div class="gmail_default">     by preventing them from launching indefinitely.</div><div class="gmail_default"><br></div><div class="gmail_default">I've made those changes and then only the lower priority jobs wait for the larger, higher-priority, job.  I must have looked past that section of the manpage a dozen times 8-/ before making the connection.  It doesn't seem to fix this when I use the backfill scheduler, but that may be due to the runtimes on the various jobs. </div><div class="gmail_default"><br></div><div class="gmail_default">For us, switching to builtin actually makes more sense for our cloud cluster setup, so no problems making that change.</div><div class="gmail_default"><br></div><div class="gmail_default">Thanks to all for your time looking at the problem.</div><div class="gmail_default"><br></div><div class="gmail_default">Best</div><div class="gmail_default"><br></div><div class="gmail_default">Michael</div><div class="gmail_default"><br></div></div><input name="virtru-metadata" type="hidden" value="{"email-policy":{"state":"closed","expirationUnit":"days","disableCopyPaste":false,"disablePrint":false,"disableForwarding":false,"enableNoauth":false,"expires":false,"isManaged":false},"attachments":{},"compose-id":"12","compose-window":{"secure":false}}"></div></div><br><div class="gmail_quote" style=""><div dir="ltr" class="gmail_attr">On Thu, Feb 28, 2019 at 7:54 AM Chris Samuel <<a href="mailto:chris@csamuel.org">chris@csamuel.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 28/2/19 7:29 am, Michael Gutteridge wrote:<br>

<br>

> 2221670 largenode sleeper.       me PD                 N/A      1 <br>

> (null)               (AssocGrpCpuLimit)<br>

<br>

That says the job exceeds some policy limit you have set and so is not <br>

permitted to start, looks like you've got a limit on the number of cores <br>

that an association has in the hierarchy either at or above that level <br>

that this would exceed.<br>

<br>

You'll probably need to go poking around with sacctmgr to see what that <br>

limit might be.<br>

<br>

All the best,<br>

Chris<br>

-- <br>

  Chris Samuel  :  <a href="http://www.csamuel.org/" rel="noreferrer" target="_blank">http://www.csamuel.org/</a>  :  Berkeley, CA, USA<br>

<br>

</blockquote></div></div>