[slurm-users] Large job starvation on cloud cluster

Wed Feb 27 21:08:56 UTC 2019

Yes, we do have time limits set on partitions- 7 days maximum, 3 days
default.  In this case, the larger job is requesting 3 days of walltime,
the smaller jobs are requesting 7.

Thanks

M

On Wed, Feb 27, 2019 at 12:41 PM Andy Riebs <andy.riebs at hpe.com> wrote:

> Michael, are you setting time limits for the jobs? That's a huge part of a
> scheduler's decision about whether another job can be run. For example, if
> a job is running with the Slurm default of "infinite," the scheduler will
> likely decide that jobs that will fit in the remaining nodes will be able
> to finish before the job that requires infinite time.
>
> Andy
>
> ------------------------------
> *From:* Michael Gutteridge <michael.gutteridge at gmail.com>
> <michael.gutteridge at gmail.com>
> *Sent:* Wednesday, February 27, 2019 3:29PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> <slurm-users at lists.schedmd.com>
> *Cc:*
> *Subject:* [slurm-users] Large job starvation on cloud cluster
> I've run into a problem with a cluster we've got in a cloud provider-
> hoping someone might have some advice.
>
> The problem is that I've got a circumstance where large jobs _never_
> start... or more correctly, that large-er jobs don't start when there are
> many smaller jobs in the partition.  In this cluster, accounts are limited
> to 300 cores.  One user has submitted a couple thousand jobs that each use
> 6 cores.  These queue up, start nodes, and eventually all 300 cores in the
> limit are busy and the remaining jobs are held with "AssocGrpCpuLimit".
> All as expected.
>
> Then another user submits a job requesting 16 cores.  This one, too, gets
> held with the same reason.  However, that larger job never starts even if
> it has the highest priority of jobs in this account (I've set it manually
> and by using nice).
>
> What I see in the sched.log is:
>
> sched: [2019-02-25T16:00:14.940] Running job scheduler
> sched: [2019-02-25T16:00:14.941] JobId=2210784 delayed for accounting
> policy
> sched: [2019-02-25T16:00:14.942] JobId=2203130 initiated
> sched: [2019-02-25T16:00:14.942] Allocate JobId=2203130 NodeList=node1
> #CPUs=6 Partition=largenode
>
> In this case, 2210784 is the job requesting 16 cores and 2203130 is one of
> the six core jobs.  This seems to happen with either the backfill or
> builtin scheduler.  I suspect what's happening is that when one of the
> smaller jobs completes, the scheduler first looks at the higher-priority
> large job, determines that it cannot run because of the constraint, looks
> at the next job in the list, determines that it can run without exceeding
> the limit, and then starts that job.  In this way, the larger job isn't
> started until all of these smaller jobs complete.
>
> I thought that switching to the builtin scheduler would fix this, but as
> slurm.conf(5) indicates:
>
> > An exception is made for jobs that can not run due
> > to partition constraints (e.g. the time limit) or
> > down/drained nodes.  In that case, lower priority
> > jobs can be initiated and not impact the higher
> > priority job.
>
> I suspect one of these exceptions is being triggered- the limit is in the
> job's association, so I don't think it's a partition constraint.  I don't
> have this problem with the on-premises cluster, so I suspect it's something
> to do with power management and the state of powered-down nodes.
>
> I've sort-of worked around this by setting a per-user limit lower than the
> per-account limit, but that doesn't address any situation where a single
> user submits large and small jobs and does lead to some other problems in
> other groups, so it's not a long-term solution.
>
> Thanks for having a look
>
>  - Michael
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190227/8d271e91/attachment-0001.html>