[slurm-users] Very large job getting starved out

Thu Mar 21 13:55:29 UTC 2019

Hello,

I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio...

JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS

359323 batch         180292     100000      79646        547        100          0

I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please?

Best regards,

David

PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 100000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize   = 10000000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 10000

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190321/3c825b4f/attachment.html>