[slurm-users] Scheduler fails to consider all associations for job submitted to multiple partitions

Thu Jul 30 17:11:20 UTC 2020

Hi Matt,

No, these partitions do not overlap; no nodes are shared between them. 
But thanks for looking at this!

Corey
On 7/30/2020 10:29 AM, Matt Jay wrote:
> Corey,
> 
> Are the partitions over the same nodes by chance?  If so, you could be hitting this:
> https://bugs.schedmd.com/show_bug.cgi?id=3881
> 
> "This is by design. If there are _any_ jobs pending (regardless of the reason for the job still pending) in a partition with a higher Priority, no jobs from a lower Priority will be launched on nodes that are shared in common."
> 
> -Matt
> 
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Corey Keasling
> Sent: Monday, July 27, 2020 8:10 PM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: [slurm-users] Scheduler fails to consider all associations for job submitted to multiple partitions
> 
> Hi Slurm Folks,
> 
> I've run into a problem with how Slurm schedules jobs submitted to multiple partitions.
> 
> I'm running Slurm 20.02.3.  Our cluster is divided into two partitions by node funding group.  All users have rights to and submit to both partitions (i.e., jobs specify -p part1,part2).  Each user's associations impose grptresrunmins limits on both CPU and memory.
> Members of one of the funding groups receive a higher fairshare on their partition.
> 
> The goal is to be able to submit jobs to both partitions and have them run in whichever partition has space - both resources and remaining TRESrunmins. Unfortunately, when a job would exceed the limit on one partition it is prevented from running on the other partition.
> 
> If I submit a bunch of jobs to two partitions, 'nistq' and 'jila', squeue soon tells me this:
> 
> [coke4948 at terra new_config_2020-06]$ squeue -u coke4948
>                JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               996052 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996053 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996054 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996055 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996056 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996057 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996058 jila,nist  jns100m coke4948 PD       0:00      1
> (AssocGrpCPURunMinutesLimit)
>               996049      jila  jns100m coke4948  R       0:04      1 node42
>               996050      jila  jns100m coke4948  R       0:04      1 node42
>               996051      jila  jns100m coke4948  R       0:04      1 node43
> 
> Yet sshare says I'm not using anything on nistq:
> 
> [coke4948 at terra new_config_2020-06]$ sshare -A coke4948 -am -o User,part,tresrunmins
>         User    Partition                    TRESRunMins
> ---------- ------------ ------------------------------
>                           cpu=143884,mem=552516096,ener+
>     coke4948         jila cpu=143884,mem=552516096,ener+
>     coke4948        nistq cpu=0,mem=0,energy=0,node=0,b+
> 
> But I have room there:
> 
> sacctmgr: list assoc where user=coke4948 format=user,part,grptresrunmins%30
>         User  Partition                 GrpTRESRunMins
> ---------- ---------- ------------------------------
>     coke4948       jila        cpu=152640,mem=1328400G
>     coke4948      nistq        cpu=218880,mem=2577600G
> 
> Here's the output from slurmctld.log with debugging cranked up.  What's most interesting is that the scheduler holds the job because the limit on the jila partition gets hit, but then we see backfill test the job against the nistq partition and fail - but again, because the jila limit's been hit!
> 
> [2020-07-27T21:05:16.151] debug2: found 4 usable nodes from config containing node[40-43] [2020-07-27T21:05:16.151] debug2: found 12 usable nodes from config containing jnode[18-29] [2020-07-27T21:05:16.151] debug2: found 16 usable nodes from config containing jnode[01-16] [2020-07-27T21:05:16.151] debug2: found 1 usable nodes from config containing jnode17 [2020-07-27T21:05:16.151] debug3: _pick_best_nodes: JobId=996058 idle_nodes 46 share_nodes 57 [2020-07-27T21:05:16.151] debug2: select_p_job_test for JobId=996058 [2020-07-27T21:05:16.151] debug2: JobId=996058 being held, if allowed the job request will exceed assoc 66(coke4948/coke4948/jila) group max running tres(cpu) minutes limit 152640 with already used 143524 + requested 48000 [2020-07-27T21:05:16.151] debug3: sched: JobId=996058 delayed for accounting policy [2020-07-27T21:05:16.616] backfill test for JobId=996058 Prio=19542493 Partition=nistq [2020-07-27T21:05:16.616] debug2: backfill: entering _try_sched for JobId=996058.
> [2020-07-27T21:05:16.616] debug2: select_p_job_test for JobId=996058 [2020-07-27T21:05:16.616] debug2: found 12 usable nodes from config containing node[44-55] [2020-07-27T21:05:16.616] debug2: found 9 usable nodes from config containing jnode[30-39] [2020-07-27T21:05:16.616] debug3: _pick_best_nodes: JobId=996058 idle_nodes 46 share_nodes 57 [2020-07-27T21:05:16.616] debug2: select_p_job_test for JobId=996058 [2020-07-27T21:05:16.616] debug2: JobId=996058 being held, if allowed the job request will exceed assoc 66(coke4948/coke4948/jila) group max running tres(cpu) minutes limit 152640 with already used 143524 + requested 48000 [2020-07-27T21:05:16.616] debug3: backfill: Failed to start
> JobId=996058: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
> 
> 
> I spent some time digging around in job_scheduler.c and acct_policy.c and, so near as I can tell, only one association is ever tried.  I'm assuming this is because, though the job can have multiple partitions (per the job_record struct), it has but one association (again, per that struct)?  Or maybe it gets split into one job per partition somewhere but the association doesn't get similarly replaced?
> 
> I'm happy to dig around some more and have a go at patching this, if only because this is important functionality for us to have.
> 
> I should mention I'm running the Backfill scheduler with no scheduler options.  If I've overlooked something I'd love to learn what it is.
> 
> Thank you for your help!
> 
> Corey
> 
> --
> Corey Keasling
> Software Manager
> JILA Computing Group
> University of Colorado-Boulder
> 440 UCB Room S244
> Boulder, CO 80309-0440
> 303-492-9643
>