[slurm-users] Longer queuing times for larger jobs

Loris Bennett loris.bennett at fu-berlin.de
Fri Jan 31 13:04:15 UTC 2020


Hi David,
David Baker <D.J.Baker at soton.ac.uk> writes:

> Hello,
>
> Our SLURM cluster is relatively small. We have 350 standard compute
> nodes each with 40 cores. The largest job that users can run on the
> partition is one requesting 32 nodes. Our cluster is a general
> university research resource and so there are many different sizes of
> jobs ranging from single core jobs, that get routed to a serial
> partition via the job-submit.lua, through to jobs requesting 32
> nodes. When we first started the service, 32 node jobs were typically
> taking in the region of 2 days to schedule -- recently queuing times
> have started to get out of hand. Our setup is essentially...
>
> PriorityFavorSmall=NO 
> FairShareDampeningFactor=5
> PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=7-0
>
> PriorityWeightAge=400000
> PriorityWeightPartition=1000
> PriorityWeightJobSize=500000
> PriorityWeightQOS=1000000
> PriorityMaxAge=7-0
>
> To try to reduce the queuing times for our bigger jobs should we
> potentially increase the PriorityWeightJobSize factor in the first
> instance to bump up the priority of such jobs? Or should we
> potentially define a set of QOSs which we assign to jobs in our
> job_submit.lua depending on the size of the job. In other words, let's
> say there is large QOS that give the largest jobs a higher priority,
> and also limits how many of those jobs that a single user can submit?
>
> Your advice would be appreciated, please. At the moment these large
> jobs are not accruing a sufficiently high priority to rise above the
> other jobs in the cluster.

We have always gone for the weighting approach, rather than the QOS
routing one.  I have always thought that QOS routing potentially takes
away some of the users' freedom unnecessarily.  What if some one wants
to submit a large number of 32-node jobs and is perfectly happy to wait
a (long) while?  We have QOSs with higher priorities, but with
restricted MaxWall, MaxJobs, MaxSubmit, MaxTRESPU, and users have to
request them explicitly.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de



More information about the slurm-users mailing list