Partition, Qos Limits & Scheduling of large jobs - slurm-users

28 Feb 2024


      Hi everyone!
I read the slurm documentation about qos, resource limits, scheduling and priority now multiple times and even looked into the slurm source but I'm still not sure if I got everything correctly, so this is why I decided to ask here ...
The problem: we see the effect that sometimes larger jobs with e.g. 16 gpus in our (small) gpu queue get delayed and shifted back without any reason that is apparent to us. small jobs that only use e.g. 1 or 2 gpus get scheduled much quicker even though they have a runtime of 3 days ...
What we want to do:
- We have a number of nodes with 2x gpus that are usable by the users of our cluster
- Some of these nodes belong to so called 'private projects'. Private projects have higher priority than other projects. Attached to that is a contingent of nodes & "guaranteed" nodes e.g. they could have a contingent of 4 nodes (8 gpus) and e.g. 2 "guaranteed" nodes (4 gpus)
- Guaranteed nodes are nodes that should always be kept idle for the private project, so users of the private project can immediately schedule work on those nodes
- The other nodes are shared with other projects in general if they are not "in use"
How we are currently doing this (it has history):
Lets assume we have 50 nodes and 100 gpus.
- We have a single partition for all gpu nodes (e.g. 50 nodes)
- Private projects have private queues with a very high priority and a gres limit of the number of gpus they reserved (e.g. 10 nodes -> 20 gpus)
- Normal projects only have access to the public queue and schedule work there.
- This public queue has an upper gres limit of "total number of gpus" - "guaranteed gpus of all private projects" (e.g. 50 - 10 nodes -> 40 nodes -> 80 gpus).
Regarding the scheduler, we currently use the following settings:
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
SchedulerParameters=defer,max_sched_time=4,default_queue_depth=1000,partition_job_depth=500,enable_user_top,bf_max_job_user=20,bf_interval=120,bf_window=4320,bf_resolution=1800,bf_continue
Partition/queue depth is deliberately set high at the moment to avoid problems with jobs not even being examined.
The problem in more detail:
One of the last jobs (16 gpus needed) we diagnosed had a approximate start time that was beyond all end times of running/scheduled jobs - like jobs on feb 22 would release more than enough gpus so the job could have been immediately scheduled afterwards, but the start time was still feb 23. Priority wise the job had the highest priority of all pending jobs for the partition.
When we turned on scheduler debugging and increased log levels, we observed the following messages for this job:
JobId=xxxx being held, if allowed the job request will exceed QOS xxxxx group max tres(gres/gpu) limit yy with already used yy + requested 16
followed by
sched: JobId=2796696 delayed for accounting policy
So to us this meant that the scheduler was always hitting the qos limits, which makes sense because the usage is always very high in the gpu queue and thus the job wasn't scheduled ...
At first we were worried, that this meant that "held"/"delayed" jobs like this would never actually get scheduled when contention is high enough e.g. small jobs getting backfilled in and thus qos limits stay at max for a long time.
But for some reason we could not determine the job eventually got scheduled at one point and then ran at the scheduled start time.
Open Questions:
- why it couldn't be scheduled in the first place. initially we thought (from the source code i looked into) the "delayed for accounting policy" prevents further scheduling in general, but since it was scheduled this assumption must be wrong?
- why it was scheduled at some point. when it was scheduled, contention was still high and the qos limits definitely still applied
- how we could modify the current setup so that the scheduling of larger jobs becomes "better" and more reproducible/explainable
Apart from all of this I'm also asking myself if there is maybe a better way to setup a system that works the way we want?
This got a bit long but I hope its clear enough :)
Kind regards,
Katrin