Dear Slurm community,
I am confused by the behaviour of a freshly built openmpi-5.0.7 with slurm-24.11.4. I can run a simple hello-world program via mpirun, but with really slow startup (a single process needing 1.6 s, 384 processes on two 192-core nodes need around half a minute).
I guess there is a deeper issue here to work out with openmpi itself and why it needs 1.3 seconds to start a single process, even, outside a Slurm environment.
So I investigate if running via srun changes things. Open-MPI docs recommend using mpirun and that has been traditionally our safe bet. But srun direct launch is supposed to work, too. I did duly build slurm-24.11.4 with
./configure --prefix=/syssw/slurm/24.11.4 \ --sysconfdir=/syssw/etc/slurm \ --with-munge=/syssw/munge/0.5.16 \ --with-hwloc=/syssw/hwloc/2.11.2 \ --disable-static --with-json \ --with-pmix=/syssw/pmix/3.2.5:/syssw/pmix/5.0.7 \ LDFLAGS=-Wl,--disable-new-dtags
providing the two versions of pmix on our system currently.
Now I am perplexed to observe that
$ srun -vv --mpi=pmix_v5 -N 1 -n 1 mpihello
does _not_ work, but produces
c$ srun -vv -n 1 -N 1 --mpi=pmix_v5 mpihello srun: defined options srun: -------------------- -------------------- srun: (null) : n[164-165] srun: jobid : 671133 srun: job-name : interactive srun: mpi : pmix_v5 srun: nodes : 1 srun: ntasks : 1 srun: verbose : 2 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: jobid 671133: nodes(2):`n[164-165]', cpu counts: 192(x2) srun: debug: requesting job 671133, user 99, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name mpihello, relative 65534 srun: CpuBindType=none srun: debug: Entering slurm_step_launch srun: debug: mpi/pmix_v5: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:417: Abort agent port: 36505 srun: debug: mpi/pmix_v5: _pmix_abort_thread: (null) [0]: pmixp_agent.c:356: Start abort thread srun: debug: mpi/pmix_v5: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:282: setup process mapping in srun srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 38681 srun: debug: Started IO server thread srun: debug: Entering _launch_tasks srun: launching StepId=671133.3 on host n165, 1 tasks: 0 srun: topology/tree: init: topology tree plugin loaded srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: Node n165, 1 tasks started [n165:2287661] PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056 *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and MPI will try to terminate your MPI job as well) [n165:2287661] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: Received task exit notification for 1 task of StepId=671133.3 (status=0x0e00). srun: error: n165: task 0: Exited with exit code 14 srun: debug: task 0 done srun: debug: IO thread exiting srun: debug: mpi/pmix_v5: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown srun: debug: mpi/pmix_v5: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit
Observe the line
[n165:2287661] PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056
anyone got an idea what that means, by what it's caused?
And, what really confuses me, is that the test program _does_ work if I switch from pmix_v5 to pmix_v3:
$ srun -vv -n 1 -N 1 --mpi=pmix_v3 mpihello srun: defined options srun: -------------------- -------------------- srun: (null) : n[164-165] srun: jobid : 671133 srun: job-name : interactive srun: mpi : pmix_v3 srun: nodes : 1 srun: ntasks : 1 srun: verbose : 2 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: jobid 671133: nodes(2):`n[164-165]', cpu counts: 192(x2) srun: debug: requesting job 671133, user 99, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name mpihello, relative 65534 srun: CpuBindType=none srun: debug: Entering slurm_step_launch srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:417: Abort agent port: 43737 srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:356: Start abort thread srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:282: setup process mapping in srun srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 43805 srun: debug: Started IO server thread srun: debug: Entering _launch_tasks srun: launching StepId=671133.4 on host n164, 1 tasks: 0 srun: topology/tree: init: topology tree plugin loaded srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: Node n164, 1 tasks started hello world from processor n164, rank 0 out of 1 srun: Received task exit notification for 1 task of StepId=671133.4 (status=0x0000). srun: n164: task 0: Completed srun: debug: task 0 done srun: debug: IO thread exiting srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit
How can a PMIx 5 MPI even work with the pmix_v3 plugin? Why does it _not_ work with the pmix_v5 plugin? I am also curious why the plugins don't link to the respective libpmix (are they using dlopen for their dependencies? Why?).
$ ldd /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so: linux-vdso.so.1 (0x00007ffd19ffb000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x000014d199e56000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014d199c70000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014d199b90000) /lib64/ld-linux-x86-64.so.2 (0x000014d199edd000) /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so: linux-vdso.so.1 (0x00007ffd265f2000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x00001553902c8000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00001553900e2000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000155390002000) /lib64/ld-linux-x86-64.so.2 (0x0000155390350000) /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so: linux-vdso.so.1 (0x00007ffd862b7000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x0000145adc36d000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000145adc187000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000145adc0a7000) /lib64/ld-linux-x86-64.so.2 (0x0000145adc3f4000)
But they do have the proper RPATH set up:
$ readelf -d /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so | grep -e ^File -e PATH File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so 0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib] File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so 0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/3.2.5/lib] File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so 0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]
Which is important, since libpmix doesn't get sensible SONAME versioning (supposing they are supposed to be separate ABIs):
$ find /syssw/pmix/* -name 'libpmix.so*' /syssw/pmix/3.2.5/lib/libpmix.so /syssw/pmix/3.2.5/lib/libpmix.so.2.2.35 /syssw/pmix/3.2.5/lib/libpmix.so.2 /syssw/pmix/5.0.7/lib/libpmix.so /syssw/pmix/5.0.7/lib/libpmix.so.2.13.7 /syssw/pmix/5.0.7/lib/libpmix.so.2
It's all libpmix.so.2. My mpihello program uses the 5.0.7 one, at least:
$ ldd mpihello linux-vdso.so.1 (0x00007fff1f13f000) libmpi.so.40 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libmpi.so.40 (0x000014ed89d95000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014ed89baf000) libopen-pal.so.80 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libopen-pal.so.80 (0x000014ed89a25000) libfabric.so.1 => /syssw/fabric/1.21.0/lib/libfabric.so.1 (0x000014ed8989b000) libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x000014ed8988d000) libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x000014ed8986c000) libpsm2.so.2 => /syssw/psm2/12.0.1/lib/libpsm2.so.2 (0x000014ed89804000) libatomic.so.1 => /sw/compiler/gcc-13.3.0/lib64/libatomic.so.1 (0x000014ed897fb000) libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x000014ed8976a000) libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x000014ed89745000) libpmix.so.2 => /syssw/pmix/5.0.7/lib/libpmix.so.2 (0x000014ed8951e000) libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x000014ed894e8000) libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x000014ed894e3000) libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x000014ed89486000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014ed893a6000) /lib64/ld-linux-x86-64.so.2 (0x000014ed8a0d8000) libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x000014ed89397000)
Can someone shed light on how the differing PMIx plugins are supposed to work? Can someone share a setup where pmix_v5 does work with openmpi 5.x?
Alrighty then,
Thomas