[slurm-users] Very large job getting starved out
cproctor at tacc.utexas.edu
Thu Mar 21 14:19:30 UTC 2019
You might have a look at the thread "Large job starvation on cloud cluster" that started on Feb 27; there's some good tidbits in there. Off the top without more information, I would venture that settings you have in slurm.conf end up backfilling the smaller jobs at the expense of scheduling the larger jobs.
Your partition configs plus accounting and scheduler configs from slurm.conf would be helpful.
Also, search for "job starvation" here: https://slurm.schedmd.com/sched_config.html as another potential starting point.
On 3/21/19 8:55 AM, David Baker wrote:
I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio...
JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS
359323 batch 180292 100000 79646 547 100 0
I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please?
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 100000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize = 10000000
PriorityWeightPartition = 1000
PriorityWeightQOS = 10000
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users