[slurm-users] backfill on overlapping partitions problem
mattjay at uw.edu
Tue Oct 26 17:00:02 UTC 2021
Take a look at this, and see if it matches up with your issue (I'm not 100% sure based on your description):
The takeaway from that is the following (quote from SchedMD): " If there are _any_ jobs pending (regardless of the reason for the job still pending) in a partition with a higher Priority, no jobs from a lower Priority will be launched on nodes that are shared in common."
The above is apparently pretty intrinsic to how Slurm scheduling works, and is unlikely to change.
We worked around this by keeping all partitions at the same priority, and using QOS instead for priority/preemption -- that has the unfortunate side effect of tying down your QOS's to be used for that purpose, but it works for our situation.
Best of luck,
Sr. HPC Systems Engineer - Hyak
University of Washington Information Technology
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Andrej Filipcic
Sent: Tuesday, October 26, 2021 7:42 AM
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] backfill on overlapping partitions problem
We have a strange problem with backfilling, there are large partition "cpu" and overlapping partition "largemem" which is a subset of "cpu" nodes.
Now, user A is submitting low priority jobs to "cpu", user B high priority jobs to "largemem"
If there are queued jobs in "largemem" (draining nodes there), the slurmctld would never backfill the "cpu". At the extreme, non-overlapping "cpu" nodes would get empty until higher prio jobs get all running in "largemem"
Any hint or workaround here? backfill works quite fine if all the jobs are submitted to "cpu" partition. User A has typically smaller and shorter jobs, good for backfilling.
we use these settings with slurm:
prof. dr. Andrej Filipcic, E-mail: Andrej.Filipcic at ijs.si
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-425-7074
More information about the slurm-users