[slurm-users] Oddities with heterogeneous jobs
Kevin Buckley
Kevin.Buckley at pawsey.org.au
Mon Apr 19 05:52:53 UTC 2021
Slurm 20.02.5
We have a user who is submitting a job script containing
three heterogeneous srun invocation
#SBATCH --nodes=15
#SBATCH --cpus-per-task=20 --ntasks=1 --ntasks-per-node=1
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=54 --ntasks-per-node=4
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=19 --ntasks-per-node=19
(And it'd be nice if the sbatch man page mentioned hetjob!)
Slurm does the "right thing" when creating the heterogeneous
jobs, in that it defines three hetjobs with
NumNodes=1-1 NumCPUs=20 NumTasks=1
NumNodes=14 NumCPUs=54 NumTasks=54
NumNodes=1 NumCPUs=19 NumTasks=19
however at times where we can see 252 idle nodes, SOME of the
jobs start whilst SOME remain PENDING with Reason=Resources.
Inititially thought that the fact that user was explictly
requesting
#SBATCH --nodes=15
as well as the hetjob definitions ,might be falling foul of
some kind of totalling up of the 1+14+1 to give 16, but the
fact that some jobs do run suggests that's not the complete,
and/or possibly not the correct, answer.
The example on SchedMD's heterogeneous.html page don't show
any "het-job-wide" request for a number of nodes, suggesting
that Slurm works it out, but there's not that much to go on,
as regards a definitive answer.
Any thoughts/experiences out there?
Kevin
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
More information about the slurm-users
mailing list