[slurm-users] Scheduler fails to consider all associations for job submitted to multiple partitions

Tue Jul 28 03:10:02 UTC 2020

Hi Slurm Folks,

I've run into a problem with how Slurm schedules jobs submitted to 
multiple partitions.

I'm running Slurm 20.02.3.  Our cluster is divided into two partitions 
by node funding group.  All users have rights to and submit to both 
partitions (i.e., jobs specify -p part1,part2).  Each user's 
associations impose grptresrunmins limits on both CPU and memory. 
Members of one of the funding groups receive a higher fairshare on their 
partition.

The goal is to be able to submit jobs to both partitions and have them 
run in whichever partition has space - both resources and remaining 
TRESrunmins. Unfortunately, when a job would exceed the limit on one 
partition it is prevented from running on the other partition.

If I submit a bunch of jobs to two partitions, 'nistq' and 'jila', 
squeue soon tells me this:

[coke4948 at terra new_config_2020-06]$ squeue -u coke4948
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
             996052 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996053 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996054 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996055 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996056 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996057 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996058 jila,nist  jns100m coke4948 PD       0:00      1 
(AssocGrpCPURunMinutesLimit)
             996049      jila  jns100m coke4948  R       0:04      1 node42
             996050      jila  jns100m coke4948  R       0:04      1 node42
             996051      jila  jns100m coke4948  R       0:04      1 node43

Yet sshare says I'm not using anything on nistq:

[coke4948 at terra new_config_2020-06]$ sshare -A coke4948 -am -o 
User,part,tresrunmins
       User    Partition                    TRESRunMins
---------- ------------ ------------------------------
                         cpu=143884,mem=552516096,ener+
   coke4948         jila cpu=143884,mem=552516096,ener+
   coke4948        nistq cpu=0,mem=0,energy=0,node=0,b+

But I have room there:

sacctmgr: list assoc where user=coke4948 format=user,part,grptresrunmins%30
       User  Partition                 GrpTRESRunMins
---------- ---------- ------------------------------
   coke4948       jila        cpu=152640,mem=1328400G
   coke4948      nistq        cpu=218880,mem=2577600G

Here's the output from slurmctld.log with debugging cranked up.  What's 
most interesting is that the scheduler holds the job because the limit 
on the jila partition gets hit, but then we see backfill test the job 
against the nistq partition and fail - but again, because the jila 
limit's been hit!

[2020-07-27T21:05:16.151] debug2: found 4 usable nodes from config 
containing node[40-43]
[2020-07-27T21:05:16.151] debug2: found 12 usable nodes from config 
containing jnode[18-29]
[2020-07-27T21:05:16.151] debug2: found 16 usable nodes from config 
containing jnode[01-16]
[2020-07-27T21:05:16.151] debug2: found 1 usable nodes from config 
containing jnode17
[2020-07-27T21:05:16.151] debug3: _pick_best_nodes: JobId=996058 
idle_nodes 46 share_nodes 57
[2020-07-27T21:05:16.151] debug2: select_p_job_test for JobId=996058
[2020-07-27T21:05:16.151] debug2: JobId=996058 being held, if allowed 
the job request will exceed assoc 66(coke4948/coke4948/jila) group max 
running tres(cpu) minutes limit 152640 with already used 143524 + 
requested 48000
[2020-07-27T21:05:16.151] debug3: sched: JobId=996058 delayed for 
accounting policy
[2020-07-27T21:05:16.616] backfill test for JobId=996058 Prio=19542493 
Partition=nistq
[2020-07-27T21:05:16.616] debug2: backfill: entering _try_sched for 
JobId=996058.
[2020-07-27T21:05:16.616] debug2: select_p_job_test for JobId=996058
[2020-07-27T21:05:16.616] debug2: found 12 usable nodes from config 
containing node[44-55]
[2020-07-27T21:05:16.616] debug2: found 9 usable nodes from config 
containing jnode[30-39]
[2020-07-27T21:05:16.616] debug3: _pick_best_nodes: JobId=996058 
idle_nodes 46 share_nodes 57
[2020-07-27T21:05:16.616] debug2: select_p_job_test for JobId=996058
[2020-07-27T21:05:16.616] debug2: JobId=996058 being held, if allowed 
the job request will exceed assoc 66(coke4948/coke4948/jila) group max 
running tres(cpu) minutes limit 152640 with already used 143524 + 
requested 48000
[2020-07-27T21:05:16.616] debug3: backfill: Failed to start 
JobId=996058: Job violates accounting/QOS policy (job submit limit, 
user's size and/or time limits)

I spent some time digging around in job_scheduler.c and acct_policy.c 
and, so near as I can tell, only one association is ever tried.  I'm 
assuming this is because, though the job can have multiple partitions 
(per the job_record struct), it has but one association (again, per that 
struct)?  Or maybe it gets split into one job per partition somewhere 
but the association doesn't get similarly replaced?

I'm happy to dig around some more and have a go at patching this, if 
only because this is important functionality for us to have.

I should mention I'm running the Backfill scheduler with no scheduler 
options.  If I've overlooked something I'd love to learn what it is.

Thank you for your help!

Corey

-- 
Corey Keasling
Software Manager
JILA Computing Group
University of Colorado-Boulder
440 UCB Room S244
Boulder, CO 80309-0440
303-492-9643