Dear Community,
I'm seeing strange behavior from sbatch with different --nodelist (-w) options on my two node cluster.
Here are my test scripts:
~/slurm$ cat mpirun.slm
#!/bin/bash
#SBATCH --job-name=mpirun_2x1
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
source /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpivars.sh
mpirun ./a.sh
~/slurm$ cat a.sh
#!/bin/bash
echo "`uname -n` OMPI_COMM_WORLD_RANK = $OMPI_COMM_WORLD_RANK SLURM_NODEID = $SLURM_NODEID SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
If I do not specify any -w option or if I include both nodes in -w option, I get expected results.
~/slurm$ sbatch mpirun.slm
Submitted batch job 71
~/slurm$ cat slurm-71.out
std-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]
std-271 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]
However, if I specify only one node in -w option, but still want two nodes, I always get one expected result and one unexpected result. The unexpected one will dispatch both MPI tasks to the same node.
This one is expected - running two MPI tasks across nodes.
~/slurm$ sbatch -w std-199 mpirun.slm
Submitted batch job 72
~/slurm$ cat slurm-72.out
std-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]
std-271 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]
This one if unexpected - ends up running two MPI tasks on the same node, though SLURM_JOB_NODELIST also gives the correct two nodes.
~/slurm$ sbatch -w std-271 mpirun.slm
Submitted batch job 73
ubuntu@bright-anchovy-controller:~/slurm$ cat slurm-73.out
std-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]
std-199 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]
I first saw the problem on a larger cluster where I needed to specify both -w and -x options to include and exclude nodes. I then narrowed it down to a two node cluster. I tried adding options like --hostfile or --rankfile, or -npernode, all does not change how the tasks are dispatched to nodes.
The problem is repeatable.
Here are the tested systems:
Slurm 23.02.5 on Ubuntu 22.04.5 LTS
Slurm 24.05.1 on Ubuntu 22.04.4 LTS
How to make the last one work. I.e., requesting -w std-271 and make it run on two nodes. I'd appreciate any help!
Regards,
Xinghong