[slurm-users] Scheduler fails to consider all associations for job submitted to multiple partitions
Corey Keasling
corey.keasling at jila.colorado.edu
Tue Jul 28 03:10:02 UTC 2020
Hi Slurm Folks,
I've run into a problem with how Slurm schedules jobs submitted to
multiple partitions.
I'm running Slurm 20.02.3. Our cluster is divided into two partitions
by node funding group. All users have rights to and submit to both
partitions (i.e., jobs specify -p part1,part2). Each user's
associations impose grptresrunmins limits on both CPU and memory.
Members of one of the funding groups receive a higher fairshare on their
partition.
The goal is to be able to submit jobs to both partitions and have them
run in whichever partition has space - both resources and remaining
TRESrunmins. Unfortunately, when a job would exceed the limit on one
partition it is prevented from running on the other partition.
If I submit a bunch of jobs to two partitions, 'nistq' and 'jila',
squeue soon tells me this:
[coke4948 at terra new_config_2020-06]$ squeue -u coke4948
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
996052 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996053 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996054 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996055 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996056 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996057 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996058 jila,nist jns100m coke4948 PD 0:00 1
(AssocGrpCPURunMinutesLimit)
996049 jila jns100m coke4948 R 0:04 1 node42
996050 jila jns100m coke4948 R 0:04 1 node42
996051 jila jns100m coke4948 R 0:04 1 node43
Yet sshare says I'm not using anything on nistq:
[coke4948 at terra new_config_2020-06]$ sshare -A coke4948 -am -o
User,part,tresrunmins
User Partition TRESRunMins
---------- ------------ ------------------------------
cpu=143884,mem=552516096,ener+
coke4948 jila cpu=143884,mem=552516096,ener+
coke4948 nistq cpu=0,mem=0,energy=0,node=0,b+
But I have room there:
sacctmgr: list assoc where user=coke4948 format=user,part,grptresrunmins%30
User Partition GrpTRESRunMins
---------- ---------- ------------------------------
coke4948 jila cpu=152640,mem=1328400G
coke4948 nistq cpu=218880,mem=2577600G
Here's the output from slurmctld.log with debugging cranked up. What's
most interesting is that the scheduler holds the job because the limit
on the jila partition gets hit, but then we see backfill test the job
against the nistq partition and fail - but again, because the jila
limit's been hit!
[2020-07-27T21:05:16.151] debug2: found 4 usable nodes from config
containing node[40-43]
[2020-07-27T21:05:16.151] debug2: found 12 usable nodes from config
containing jnode[18-29]
[2020-07-27T21:05:16.151] debug2: found 16 usable nodes from config
containing jnode[01-16]
[2020-07-27T21:05:16.151] debug2: found 1 usable nodes from config
containing jnode17
[2020-07-27T21:05:16.151] debug3: _pick_best_nodes: JobId=996058
idle_nodes 46 share_nodes 57
[2020-07-27T21:05:16.151] debug2: select_p_job_test for JobId=996058
[2020-07-27T21:05:16.151] debug2: JobId=996058 being held, if allowed
the job request will exceed assoc 66(coke4948/coke4948/jila) group max
running tres(cpu) minutes limit 152640 with already used 143524 +
requested 48000
[2020-07-27T21:05:16.151] debug3: sched: JobId=996058 delayed for
accounting policy
[2020-07-27T21:05:16.616] backfill test for JobId=996058 Prio=19542493
Partition=nistq
[2020-07-27T21:05:16.616] debug2: backfill: entering _try_sched for
JobId=996058.
[2020-07-27T21:05:16.616] debug2: select_p_job_test for JobId=996058
[2020-07-27T21:05:16.616] debug2: found 12 usable nodes from config
containing node[44-55]
[2020-07-27T21:05:16.616] debug2: found 9 usable nodes from config
containing jnode[30-39]
[2020-07-27T21:05:16.616] debug3: _pick_best_nodes: JobId=996058
idle_nodes 46 share_nodes 57
[2020-07-27T21:05:16.616] debug2: select_p_job_test for JobId=996058
[2020-07-27T21:05:16.616] debug2: JobId=996058 being held, if allowed
the job request will exceed assoc 66(coke4948/coke4948/jila) group max
running tres(cpu) minutes limit 152640 with already used 143524 +
requested 48000
[2020-07-27T21:05:16.616] debug3: backfill: Failed to start
JobId=996058: Job violates accounting/QOS policy (job submit limit,
user's size and/or time limits)
I spent some time digging around in job_scheduler.c and acct_policy.c
and, so near as I can tell, only one association is ever tried. I'm
assuming this is because, though the job can have multiple partitions
(per the job_record struct), it has but one association (again, per that
struct)? Or maybe it gets split into one job per partition somewhere
but the association doesn't get similarly replaced?
I'm happy to dig around some more and have a go at patching this, if
only because this is important functionality for us to have.
I should mention I'm running the Backfill scheduler with no scheduler
options. If I've overlooked something I'd love to learn what it is.
Thank you for your help!
Corey
--
Corey Keasling
Software Manager
JILA Computing Group
University of Colorado-Boulder
440 UCB Room S244
Boulder, CO 80309-0440
303-492-9643
More information about the slurm-users
mailing list