[slurm-users] Effect of PriorityMaxAge on job throughput

David Baker D.J.Baker at soton.ac.uk
Wed Apr 10 09:43:24 UTC 2019


Michael,

Thank you for your reply and your thoughts. These are the priority weights that I have configured in the slurm.conf.

PriorityWeightFairshare=1000000
PriorityWeightAge=100000
PriorityWeightPartition=1000
PriorityWeightJobSize=10000000
PriorityWeightQOS=10000

I've made the PWJobSize to be the highest factor, however I understand that that only provides a once-off kick to jobs and so it probably insignificant in the longer run . That's followed by the PWFairshare.

Should I really be looking at increasing the PWAge factor to help to "push jobs" through the system?

The other issue that might play a part is that we see a lot of single node jobs (presumably backfilled) into the system. Users aren't excessively bombing the cluster, but maybe some backfill throttling would be useful as well (?)

What are your thoughts having seen the priority factors, please? I've attached a copy of the slurm.conf just in case you or anyone else wants to take a more complete overview.

Best regards,
David

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Michael Gutteridge <michael.gutteridge at gmail.com>
Sent: 09 April 2019 18:59
To: Slurm User Community List
Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput


It might be useful to include the various priority factors you've got configured.  The fact that adjusting PriorityMaxAge had a dramatic effect suggests that the age factor is pretty high- might be worth looking at that value relative to the other factors.

Have you looked at PriorityWeightJobSize?  Might have some utility if you're finding large jobs getting short-shrift.

 - Michael


On Tue, Apr 9, 2019 at 2:01 AM David Baker <D.J.Baker at soton.ac.uk<mailto:D.J.Baker at soton.ac.uk>> wrote:
Hello,

I've finally got the job throughput/turnaround to be reasonable in our cluster. Most of the time the job activity on the cluster sets the default QOS to 32 nodes (there are 464 nodes in the default queue). Jobs requesting nodes close to the QOS level (for example 22 nodes) are scheduled within 24 hours which is better than it has been. Still I suspect there is room for improvement. I note that these large jobs still struggle to be given a starttime, however many jobs are now being given a starttime following my SchedulerParameters makeover.

I used advice from the mailing list and the Slurm high throughput document to help me make changes to the scheduling parameters. They are now...

SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000

Also..
PriorityFavorSmall=NO
PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=1-0

The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 1-0. Before that change the larger jobs could hang around in the queue for days. Does it make sense therefore to further reduce PriorityMaxAge to less than 1 day? Your advice would be appreciated, please.

Best regards,
David



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190410/b0d572ed/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 6988 bytes
Desc: slurm.conf
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190410/b0d572ed/attachment-0001.obj>


More information about the slurm-users mailing list