Over the past few days I grabbed some time on the nodes and ran for a few hours. Looks like I *can* still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing that exonerates cgroups as the cause, but possibly just a good way to tickle the real issue. Over the next few days, I'll try to roll everything back to RHEL 8.9 and see how that goes.
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Thursday, May 2, 2024 11:32 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun launched mpi job occasionally core dumps
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the controller with the '-i' option) but still reproduced the issue fairly quickly. Feels like the issue might be some interaction with RHEL 9.3 cgroups and slurm. Not sure what to try next - hoping for some suggestions.
Thanks,
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Wednesday, May 1, 2024 11:21 AM To: slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com Subject: [slurm-users] srun launched mpi job occasionally core dumps
Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi. The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch.
SW stack is self-built with:
* Dual socket AMD nodes
* RHEL 9.3 base system + tools
* Single 100 Gb card per host
* hwloc 2.9.3
* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)
* slurm 23.11.6 (started with 23.11.5 - update did not change the behavior)
* openmpi 5.0.3
The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter. Application core dump looks like the following regardless of the test running:
[cn04:1194785] *** Process received signal *** [cn04:1194785] Signal: Segmentation fault (11) [cn04:1194785] Signal code: Address not mapped (1) [cn04:1194785] Failing at address: 0xe0 [cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0] [cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d] [cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c] [cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88] [cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7] [cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af] [cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365] [cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d] [cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae] [cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780] [cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0] [cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60] [cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5] [cn04:1194785] *** End of error message ***
More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6. One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away. 0 failures in ~2500 runs. Another thing that seems to help is to disabling cgroups in the slurm.conf. After the change, saw 0 failures in >6100 hello_world_mpi runs.
The changes in the slurm.conf were - original: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherType=jobacct_gather/cgroup
Changed ProctrackType=proctrack/linuxproc TaskPlugin=task/affinity JobAcctGatherType=jobacct_gather/linux
My cgroup.conf file contains: ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRamSpace=95
Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it. Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions.
Thanks,
Brent