Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi. The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch.
SW stack is self-built with:
* Dual socket AMD nodes
* RHEL 9.3 base system + tools
* Single 100 Gb card per host
* hwloc 2.9.3
* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)
* slurm 23.11.6 (started with 23.11.5 - update did not change the behavior)
* openmpi 5.0.3
The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter. Application core dump looks like the following regardless of the test running:
[cn04:1194785] *** Process received signal *** [cn04:1194785] Signal: Segmentation fault (11) [cn04:1194785] Signal code: Address not mapped (1) [cn04:1194785] Failing at address: 0xe0 [cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0] [cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d] [cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c] [cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88] [cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7] [cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af] [cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365] [cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d] [cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae] [cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780] [cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0] [cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60] [cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5] [cn04:1194785] *** End of error message ***
More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6. One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away. 0 failures in ~2500 runs. Another thing that seems to help is to disabling cgroups in the slurm.conf. After the change, saw 0 failures in >6100 hello_world_mpi runs.
The changes in the slurm.conf were - original: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherType=jobacct_gather/cgroup
Changed ProctrackType=proctrack/linuxproc TaskPlugin=task/affinity JobAcctGatherType=jobacct_gather/linux
My cgroup.conf file contains: ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRamSpace=95
Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it. Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions.
Thanks,
Brent
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the controller with the '-i' option) but still reproduced the issue fairly quickly. Feels like the issue might be some interaction with RHEL 9.3 cgroups and slurm. Not sure what to try next - hoping for some suggestions.
Thanks,
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Wednesday, May 1, 2024 11:21 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] srun launched mpi job occasionally core dumps
Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi. The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch.
SW stack is self-built with:
* Dual socket AMD nodes
* RHEL 9.3 base system + tools
* Single 100 Gb card per host
* hwloc 2.9.3
* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)
* slurm 23.11.6 (started with 23.11.5 - update did not change the behavior)
* openmpi 5.0.3
The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter. Application core dump looks like the following regardless of the test running:
[cn04:1194785] *** Process received signal *** [cn04:1194785] Signal: Segmentation fault (11) [cn04:1194785] Signal code: Address not mapped (1) [cn04:1194785] Failing at address: 0xe0 [cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0] [cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d] [cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c] [cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88] [cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7] [cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af] [cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365] [cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d] [cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae] [cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780] [cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0] [cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60] [cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5] [cn04:1194785] *** End of error message ***
More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6. One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away. 0 failures in ~2500 runs. Another thing that seems to help is to disabling cgroups in the slurm.conf. After the change, saw 0 failures in >6100 hello_world_mpi runs.
The changes in the slurm.conf were - original: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherType=jobacct_gather/cgroup
Changed ProctrackType=proctrack/linuxproc TaskPlugin=task/affinity JobAcctGatherType=jobacct_gather/linux
My cgroup.conf file contains: ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRamSpace=95
Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it. Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions.
Thanks,
Brent
Over the past few days I grabbed some time on the nodes and ran for a few hours. Looks like I *can* still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing that exonerates cgroups as the cause, but possibly just a good way to tickle the real issue. Over the next few days, I'll try to roll everything back to RHEL 8.9 and see how that goes.
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Thursday, May 2, 2024 11:32 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun launched mpi job occasionally core dumps
Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the controller with the '-i' option) but still reproduced the issue fairly quickly. Feels like the issue might be some interaction with RHEL 9.3 cgroups and slurm. Not sure what to try next - hoping for some suggestions.
Thanks,
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Wednesday, May 1, 2024 11:21 AM To: slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com Subject: [slurm-users] srun launched mpi job occasionally core dumps
Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi. The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch.
SW stack is self-built with:
* Dual socket AMD nodes
* RHEL 9.3 base system + tools
* Single 100 Gb card per host
* hwloc 2.9.3
* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)
* slurm 23.11.6 (started with 23.11.5 - update did not change the behavior)
* openmpi 5.0.3
The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter. Application core dump looks like the following regardless of the test running:
[cn04:1194785] *** Process received signal *** [cn04:1194785] Signal: Segmentation fault (11) [cn04:1194785] Signal code: Address not mapped (1) [cn04:1194785] Failing at address: 0xe0 [cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0] [cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d] [cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c] [cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88] [cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7] [cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af] [cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365] [cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d] [cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae] [cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780] [cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0] [cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60] [cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5] [cn04:1194785] *** End of error message ***
More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6. One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away. 0 failures in ~2500 runs. Another thing that seems to help is to disabling cgroups in the slurm.conf. After the change, saw 0 failures in >6100 hello_world_mpi runs.
The changes in the slurm.conf were - original: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherType=jobacct_gather/cgroup
Changed ProctrackType=proctrack/linuxproc TaskPlugin=task/affinity JobAcctGatherType=jobacct_gather/linux
My cgroup.conf file contains: ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRamSpace=95
Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it. Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions.
Thanks,
Brent
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few hours. Looks like I **can** still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing that exonerates cgroups as the cause, but possibly just a good way to tickle the real issue. Over the next few days, I’ll try to roll everything back to RHEL 8.9 and see how that goes.
My 2 cents: RHEL/AlmaLinux/RockyLinux 9.4 is out now, maybe it's worth a try to update to 9.4?
/Ole
Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with two slightly different setups.
1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw the issue.
2) There was not a pre-built version of the ice driver on the intel download site, so I built it myself, rebooted and re-ran the test. It did greatly reduced the number of occurrences of the issue - but didn't eliminate them.
This is similar to what I saw on the RHEL 9.3 setup (adding the intel ICE driver reduced occurrences but did not eliminate them entirely).
I can also report that the 23.02.7 tree had the similar results on the 9.3 node setup. Going backwards on the slurm bits did not seem to change the number of occurrences.
Unfortunately I think I'm out of time for experiments on these nodes, but maybe this thread will be useful to others down the road.
Brent
PS - sorry for my last post getting tagged as s new issue. Hopefully this one will thread correctly.