[slurm-users] Inconsistent cpu bindings with cpu-bind=none
Donners, John
john.donners at atos.net
Tue Feb 18 21:41:29 UTC 2020
Hi all,
I have a few more remarks about this question (I have been in contact with Marcus about this):
- the idea of the jobscript is that SLURM does not do any binding and leaves binding up to
mpirun.
- this works fine on the first node, where SLURM does not bind the processes (so mpirun can do this)
- on the second node SLURM uses (faulty) core binding (all processes are bound round-robin to the hyperthreads
of the first core). Intel's mpirun respects the cpuset and as a result the processes are bound incorrectly.
This looks like a SLURM issue to me. SLURM version 19.05.5 is used.
A workaround is to use I_MPI_PIN_RESPECT_CPUSET=no.
Cheers,
John
Hi everyone,
I am facing a bit of a weird issue with CPU bindings and mpirun:
My jobscript:
#SBATCH -N 20
#SBATCH --tasks-per-node=40
#SBATCH -p medium40
#SBATCH -t 30
#SBATCH -o out/%J.out
#SBATCH -e out/%J.err
#SBATCH --reservation=root_98
module load impi/2019.4 2>&1
export I_MPI_DEBUG=6
export SLURM_CPU_BIND=none
. /sw/comm/impi/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpivars.sh realease
BENCH=/sw/comm/impi/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1
mpirun -np 800 $BENCH -npmin 800 -iter 50 -time 120 -msglog 16:18 -include Allreduce Bcast Barrier Exchange Gather PingPing PingPong Reduce Scatter Allgather Alltoall Reduce_scatter
My output is as follows:
[...]
[0] MPI startup(): 37 154426 gcn1311 {37,77}
[0] MPI startup(): 38 154427 gcn1311 {38,78}
[0] MPI startup(): 39 154428 gcn1311 {39,79}
[0] MPI startup(): 40 161061 gcn1312 {0}
[0] MPI startup(): 41 161062 gcn1312 {40}
[0] MPI startup(): 42 161063 gcn1312 {0}
[0] MPI startup(): 43 161064 gcn1312 {40}
[0] MPI startup(): 44 161065 gcn1312 {0}
[...]
On 8 out of 20 nodes I got the wrong pinning. In the slurmd logs I found
that on nodes, where the pinning was correct, manual binding was
communicated correctly:
lllp_distribution jobid [2065227] manual binding: none
On those, where it did not work, not so much:
lllp_distribution jobid [2065227] default auto binding: cores, dist 1
So, for some reason, slurm told some task to use CPU bindings and for
some, the cpu binding was (correctly) disabled.
Any ideas what could cause this?
Best,
Marcus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200218/509fadba/attachment.htm>
More information about the slurm-users
mailing list