[slurm-users] job not running if partition MaxCPUsPerNode < actual max

Diego Zuccato diego.zuccato at unibo.it
Tue Oct 3 08:17:01 UTC 2023


I've been recently hit by
EnforcePartLimits=ALL
that refused jobs that couldn't run in all given partitions. Solved by 
setting it to
EnforcePartLimits=ANY
so that the job gets queued if it can run in ANY given partition (very 
useful if you also use JobSubmitPlugin=all_partitions ).

Diego


Il 15/08/2023 18:43, Bernstein, Noam CIV USN NRL (6393) Washington DC 
(USA) ha scritto:
> We have a heterogeneous mix of nodes, most 32 core, but one group of 36 
> core, grouped into homogeneous partitions.  We like to be able to 
> specify multiple partitions so that a job can run on any homogeneous 
> group.  It would be nice if we could run on all such nodes using 32 
> cores per node.  To try to do this, I created a partition for the 
> 36-core nodes (call them n2019) which specifies a max cpu # of 64
> 
>     PartitionName=n2019            DefMemPerCPU=2631 Nodes=compute-4-[0-47]
>     PartitionName=n2019_32         DefMemPerCPU=2631
>     Nodes=compute-4-[0-47] MaxCPUsPerNode=64
>     PartitionName=n2021            DefMemPerCPU=2960 Nodes=compute-7-[0-18]
> 
> 
> However, if I try to run a 128 task, 1 task per core job on n2019_32, 
> the sbatch fails with
> 
>      > sbatch  --ntasks=128 --exclusive
>     --partition=n2019_32  --ntasks-per-core=1 job.pbs
> 
>     sbatch: error: Batch job submission failed: Requested node
>     configuration is not available
> 
> (please ignore the ".pbs" - it's a relic, and the job script works with 
> slurm). The identical command but with "n2019" or "n2021" for the 
> partition works (but the former uses 36 cores per node). If I specify 
> multiple partitions it will only actually run when the non-n2019 (same 
> node set as n2019_32) nodes are available.
> 
> The job header includes only walltime, job name and stdout/stderr files, 
> shell, and a job array range.
> 
> I tried to add "-v" to the sbatch to see if that gives more useful info, 
> but I couldn't get any more insight.  Does anyone have any idea why it's 
> rejecting my job?
> 
> thanks,
> Noam

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list