[slurm-users] Very large job getting starved out
David Baker
D.J.Baker at soton.ac.uk
Thu Mar 21 13:55:29 UTC 2019
Hello,
I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio...
JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS
359323 batch 180292 100000 79646 547 100 0
I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please?
Best regards,
David
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 100000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize = 10000000
PriorityWeightPartition = 1000
PriorityWeightQOS = 10000
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190321/3c825b4f/attachment.html>
More information about the slurm-users
mailing list