Dear Community,
I'm seeing strange behavior from sbatch with different --nodelist (-w) options on my two node cluster. Here are my test scripts:
*~/slurm$ cat mpirun.slm#!/bin/bash#SBATCH --job-name=mpirun_2x1#SBATCH --nodes=2#SBATCH --ntasks-per-node=1#SBATCH --exclusivesource /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpivars.shmpirun ./a.sh*
*~/slurm$ cat a.sh#!/bin/bashecho "`uname -n` OMPI_COMM_WORLD_RANK = $OMPI_COMM_WORLD_RANK SLURM_NODEID = $SLURM_NODEID SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"*
If I do not specify any -w option or if I include both nodes in -w option, I get expected results.
*~/slurm$ sbatch mpirun.slmSubmitted batch job 71~/slurm$ cat slurm-71.outstd-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]std-271 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]*
However, if I specify only one node in -w option, but still want two nodes, I always get one expected result and one unexpected result. The unexpected one will dispatch both MPI tasks to the same node.
This one is expected - running two MPI tasks across nodes.
*~/slurm$ sbatch -w std-199 mpirun.slmSubmitted batch job 72~/slurm$ cat slurm-72.outstd-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]std-271 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]* This one if unexpected - ends up running two MPI tasks on the same node, though SLURM_JOB_NODELIST also gives the correct two nodes.
*~/slurm$ sbatch -w std-271 mpirun.slmSubmitted batch job 73ubuntu@bright-anchovy-controller:~/slurm$ cat slurm-73.outstd-199 OMPI_COMM_WORLD_RANK = 0 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]std-199 OMPI_COMM_WORLD_RANK = 1 SLURM_NODEID = 0 SLURM_JOB_NODELIST = std-[199,271]*
I first saw the problem on a larger cluster where I needed to specify both -w and -x options to include and exclude nodes. I then narrowed it down to a two node cluster. I tried adding options like --hostfile or --rankfile, or -npernode, all does not change how the tasks are dispatched to nodes.
The problem is repeatable.
Here are the tested systems: Slurm 23.02.5 on Ubuntu 22.04.5 LTS Slurm 24.05.1 on Ubuntu 22.04.4 LTS
How to make the last one work. I.e., requesting *-w std-271* and make it run on two nodes. I'd appreciate any help!
Regards, Xinghong