[slurm-users] Oddities with heterogeneous jobs

Kevin Buckley Kevin.Buckley at pawsey.org.au
Mon Apr 19 05:52:53 UTC 2021


Slurm 20.02.5

We have a user who is submitting a job script containing
three heterogeneous srun invocation

#SBATCH --nodes=15

#SBATCH --cpus-per-task=20 --ntasks=1 --ntasks-per-node=1
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=54 --ntasks-per-node=4
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=19 --ntasks-per-node=19

(And it'd be nice if the sbatch man page mentioned hetjob!)

Slurm does the "right thing" when creating the heterogeneous
jobs, in that it defines three hetjobs with

NumNodes=1-1 NumCPUs=20 NumTasks=1

NumNodes=14 NumCPUs=54 NumTasks=54

NumNodes=1 NumCPUs=19 NumTasks=19

however at times where we can see 252 idle nodes, SOME of the
jobs start whilst SOME remain PENDING with Reason=Resources.

Inititially thought that the fact that user was explictly
requesting

#SBATCH --nodes=15

as well as the hetjob definitions ,might be falling foul of
some kind of totalling up of the 1+14+1 to give 16, but the
fact that some jobs do run suggests that's not the complete,
and/or possibly not the correct, answer.

The example on SchedMD's heterogeneous.html page don't show
any "het-job-wide" request for a number of nodes, suggesting
that Slurm works it out, but there's not that much to go on,
as regards a definitive answer.

Any thoughts/experiences out there?

Kevin
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



More information about the slurm-users mailing list