[slurm-users] Large job starvation on cloud cluster

Wed Feb 27 20:29:17 UTC 2019

I've run into a problem with a cluster we've got in a cloud provider-
hoping someone might have some advice.

The problem is that I've got a circumstance where large jobs _never_
start... or more correctly, that large-er jobs don't start when there are
many smaller jobs in the partition.  In this cluster, accounts are limited
to 300 cores.  One user has submitted a couple thousand jobs that each use
6 cores.  These queue up, start nodes, and eventually all 300 cores in the
limit are busy and the remaining jobs are held with "AssocGrpCpuLimit".
All as expected.

Then another user submits a job requesting 16 cores.  This one, too, gets
held with the same reason.  However, that larger job never starts even if
it has the highest priority of jobs in this account (I've set it manually
and by using nice).

What I see in the sched.log is:

sched: [2019-02-25T16:00:14.940] Running job scheduler
sched: [2019-02-25T16:00:14.941] JobId=2210784 delayed for accounting policy
sched: [2019-02-25T16:00:14.942] JobId=2203130 initiated
sched: [2019-02-25T16:00:14.942] Allocate JobId=2203130 NodeList=node1
#CPUs=6 Partition=largenode

In this case, 2210784 is the job requesting 16 cores and 2203130 is one of
the six core jobs.  This seems to happen with either the backfill or
builtin scheduler.  I suspect what's happening is that when one of the
smaller jobs completes, the scheduler first looks at the higher-priority
large job, determines that it cannot run because of the constraint, looks
at the next job in the list, determines that it can run without exceeding
the limit, and then starts that job.  In this way, the larger job isn't
started until all of these smaller jobs complete.

I thought that switching to the builtin scheduler would fix this, but as
slurm.conf(5) indicates:

> An exception is made for jobs that can not run due
> to partition constraints (e.g. the time limit) or
> down/drained nodes.  In that case, lower priority
> jobs can be initiated and not impact the higher
> priority job.

I suspect one of these exceptions is being triggered- the limit is in the
job's association, so I don't think it's a partition constraint.  I don't
have this problem with the on-premises cluster, so I suspect it's something
to do with power management and the state of powered-down nodes.

I've sort-of worked around this by setting a per-user limit lower than the
per-account limit, but that doesn't address any situation where a single
user submits large and small jobs and does lead to some other problems in
other groups, so it's not a long-term solution.

Thanks for having a look

 - Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190227/fd13398b/attachment-0001.html>