[slurm-users] Large job starvation on cloud cluster

Thu Feb 28 18:15:49 UTC 2019

It's the association (account) limit.  The problem being that lower
priority jobs were backfilling (even with the builtin scheduler) around
this larger job preventing it from running.

I have found what looks like the solution.  I need to switch to the builtin
scheduler and add "assoc_limit_stop" to "SchedulerParameters".  From
slurm.conf(5):

assoc_limit_stop
     If  set  and  a  job cannot start due to association limits,
     then do not attempt to initiate any lower priority  jobs  in
     that  partition. Setting this can decrease system throughput
     and utilization, but avoid potentially starving larger  jobs
     by preventing them from launching indefinitely.

I've made those changes and then only the lower priority jobs wait for the
larger, higher-priority, job.  I must have looked past that section of the
manpage a dozen times 8-/ before making the connection.  It doesn't seem to
fix this when I use the backfill scheduler, but that may be due to the
runtimes on the various jobs.

For us, switching to builtin actually makes more sense for our cloud
cluster setup, so no problems making that change.

Thanks to all for your time looking at the problem.

Best

Michael

On Thu, Feb 28, 2019 at 7:54 AM Chris Samuel <chris at csamuel.org> wrote:

> On 28/2/19 7:29 am, Michael Gutteridge wrote:
>
> > 2221670 largenode sleeper.       me PD                 N/A      1
> > (null)               (AssocGrpCpuLimit)
>
> That says the job exceeds some policy limit you have set and so is not
> permitted to start, looks like you've got a limit on the number of cores
> that an association has in the hierarchy either at or above that level
> that this would exceed.
>
> You'll probably need to go poking around with sacctmgr to see what that
> limit might be.
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190228/10920a49/attachment.html>