srun launched mpi job occasionally core dumps - slurm-users

1 May 2024


      Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init().  E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi.  Same binary launched with mpirun does not experience the issue.  E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi.  The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch.
SW stack is self-built with:
*         Dual socket AMD nodes
*         RHEL 9.3 base system + tools
*         Single 100 Gb card per host
*         hwloc 2.9.3
*         pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)
*         slurm 23.11.6 (started with 23.11.5 - update did not change the behavior)
*         openmpi 5.0.3
The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter.  Application core dump looks like the following regardless of the test running:
[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***
More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6.  One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away.  0 failures in ~2500 runs.  Another thing that seems to help is to disabling cgroups in the slurm.conf.  After the change, saw 0 failures in >6100 hello_world_mpi runs.
The changes in the slurm.conf were - original:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
Changed
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
JobAcctGatherType=jobacct_gather/linux
My cgroup.conf file contains:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=95
Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it.  Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions.
Thanks,
Brent