[slurm-users] Large job starvation on cloud cluster
michael.gutteridge at gmail.com
Thu Feb 28 18:15:49 UTC 2019
It's the association (account) limit. The problem being that lower
priority jobs were backfilling (even with the builtin scheduler) around
this larger job preventing it from running.
I have found what looks like the solution. I need to switch to the builtin
scheduler and add "assoc_limit_stop" to "SchedulerParameters". From
If set and a job cannot start due to association limits,
then do not attempt to initiate any lower priority jobs in
that partition. Setting this can decrease system throughput
and utilization, but avoid potentially starving larger jobs
by preventing them from launching indefinitely.
I've made those changes and then only the lower priority jobs wait for the
larger, higher-priority, job. I must have looked past that section of the
manpage a dozen times 8-/ before making the connection. It doesn't seem to
fix this when I use the backfill scheduler, but that may be due to the
runtimes on the various jobs.
For us, switching to builtin actually makes more sense for our cloud
cluster setup, so no problems making that change.
Thanks to all for your time looking at the problem.
On Thu, Feb 28, 2019 at 7:54 AM Chris Samuel <chris at csamuel.org> wrote:
> On 28/2/19 7:29 am, Michael Gutteridge wrote:
> > 2221670 largenode sleeper. me PD N/A 1
> > (null) (AssocGrpCpuLimit)
> That says the job exceeds some policy limit you have set and so is not
> permitted to start, looks like you've got a limit on the number of cores
> that an association has in the hierarchy either at or above that level
> that this would exceed.
> You'll probably need to go poking around with sacctmgr to see what that
> limit might be.
> All the best,
> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users