Dear Slurm community,
I am confused by the behaviour of a freshly built openmpi-5.0.7 with
slurm-24.11.4. I can run a simple hello-world program via mpirun, but
with really slow startup (a single process needing 1.6 s, 384 processes
on two 192-core nodes need around half a minute).
I guess there is a deeper issue here to work out with openmpi itself
and why it needs 1.3 seconds to start a single process, even, outside a
Slurm environment.
So I investigate if running via srun changes things. Open-MPI docs
recommend using mpirun and that has been traditionally our safe bet.
But srun direct launch is supposed to work, too. I did duly build
slurm-24.11.4 with
./configure --prefix=/syssw/slurm/24.11.4 \
--sysconfdir=/syssw/etc/slurm \
--with-munge=/syssw/munge/0.5.16 \
--with-hwloc=/syssw/hwloc/2.11.2 \
--disable-static --with-json \
--with-pmix=/syssw/pmix/3.2.5:/syssw/pmix/5.0.7 \
LDFLAGS=-Wl,--disable-new-dtags
providing the two versions of pmix on our system currently.
Now I am perplexed to observe that
$ srun -vv --mpi=pmix_v5 -N 1 -n 1 mpihello
does _not_ work, but produces
c$ srun -vv -n 1 -N 1 --mpi=pmix_v5 mpihello
srun: defined options
srun: -------------------- --------------------
srun: (null) : n[164-165]
srun: jobid : 671133
srun: job-name : interactive
srun: mpi : pmix_v5
srun: nodes : 1
srun: ntasks : 1
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: jobid 671133: nodes(2):`n[164-165]', cpu counts: 192(x2)
srun: debug: requesting job 671133, user 99, nodes 1 including ((null))
srun: debug: cpus 1, tasks 1, name mpihello, relative 65534
srun: CpuBindType=none
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v5: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:417: Abort agent port: 36505
srun: debug: mpi/pmix_v5: _pmix_abort_thread: (null) [0]: pmixp_agent.c:356: Start abort thread
srun: debug: mpi/pmix_v5: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:282: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: initialized stdio listening socket, port 38681
srun: debug: Started IO server thread
srun: debug: Entering _launch_tasks
srun: launching StepId=671133.3 on host n165, 1 tasks: 0
srun: topology/tree: init: topology tree plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Node n165, 1 tasks started
[n165:2287661] PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
[n165:2287661] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: Received task exit notification for 1 task of StepId=671133.3 (status=0x0e00).
srun: error: n165: task 0: Exited with exit code 14
srun: debug: task 0 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v5: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown
srun: debug: mpi/pmix_v5: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit
Observe the line
[n165:2287661] PMIX ERROR: PMIX_ERR_FILE_OPEN_FAILURE in file gds_shmem2.c at line 1056
anyone got an idea what that means, by what it's caused?
And, what really confuses me, is that the test program _does_ work if I
switch from pmix_v5 to pmix_v3:
$ srun -vv -n 1 -N 1 --mpi=pmix_v3 mpihello
srun: defined options
srun: -------------------- --------------------
srun: (null) : n[164-165]
srun: jobid : 671133
srun: job-name : interactive
srun: mpi : pmix_v3
srun: nodes : 1
srun: ntasks : 1
srun: verbose : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: jobid 671133: nodes(2):`n[164-165]', cpu counts: 192(x2)
srun: debug: requesting job 671133, user 99, nodes 1 including ((null))
srun: debug: cpus 1, tasks 1, name mpihello, relative 65534
srun: CpuBindType=none
srun: debug: Entering slurm_step_launch
srun: debug: mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:417: Abort agent port: 43737
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:356: Start abort thread
srun: debug: mpi/pmix_v3: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:282: setup process mapping in srun
srun: debug: Entering _msg_thr_create()
srun: debug: initialized stdio listening socket, port 43805
srun: debug: Started IO server thread
srun: debug: Entering _launch_tasks
srun: launching StepId=671133.4 on host n164, 1 tasks: 0
srun: topology/tree: init: topology tree plugin loaded
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: Node n164, 1 tasks started
hello world from processor n164, rank 0 out of 1
srun: Received task exit notification for 1 task of StepId=671133.4 (status=0x0000).
srun: n164: task 0: Completed
srun: debug: task 0 done
srun: debug: IO thread exiting
srun: debug: mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:109: false, shutdown
srun: debug: mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:363: Abort thread exit
How can a PMIx 5 MPI even work with the pmix_v3 plugin? Why does it
_not_ work with the pmix_v5 plugin? I am also curious why the plugins
don't link to the respective libpmix (are they using dlopen for their
dependencies? Why?).
$ ldd /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so:
linux-vdso.so.1 (0x00007ffd19ffb000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x000014d199e56000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014d199c70000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014d199b90000)
/lib64/ld-linux-x86-64.so.2 (0x000014d199edd000)
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so:
linux-vdso.so.1 (0x00007ffd265f2000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x00001553902c8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00001553900e2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000155390002000)
/lib64/ld-linux-x86-64.so.2 (0x0000155390350000)
/syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so:
linux-vdso.so.1 (0x00007ffd862b7000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x0000145adc36d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000145adc187000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000145adc0a7000)
/lib64/ld-linux-x86-64.so.2 (0x0000145adc3f4000)
But they do have the proper RPATH set up:
$ readelf -d /syssw/slurm/24.11.4/lib/slurm/mpi_pmix*.so | grep -e ^File -e PATH
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix.so
0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v3.so
0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/3.2.5/lib]
File: /syssw/slurm/24.11.4/lib/slurm/mpi_pmix_v5.so
0x000000000000000f (RPATH) Library rpath: [/syssw/hwloc/2.11.2/lib:/syssw/pmix/5.0.7/lib]
Which is important, since libpmix doesn't get sensible SONAME
versioning (supposing they are supposed to be separate ABIs):
$ find /syssw/pmix/* -name 'libpmix.so*'
/syssw/pmix/3.2.5/lib/libpmix.so
/syssw/pmix/3.2.5/lib/libpmix.so.2.2.35
/syssw/pmix/3.2.5/lib/libpmix.so.2
/syssw/pmix/5.0.7/lib/libpmix.so
/syssw/pmix/5.0.7/lib/libpmix.so.2.13.7
/syssw/pmix/5.0.7/lib/libpmix.so.2
It's all libpmix.so.2. My mpihello program uses the 5.0.7 one, at least:
$ ldd mpihello
linux-vdso.so.1 (0x00007fff1f13f000)
libmpi.so.40 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libmpi.so.40 (0x000014ed89d95000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000014ed89baf000)
libopen-pal.so.80 => /sw/env/gcc-13.3.0/openmpi/5.0.7/lib/libopen-pal.so.80 (0x000014ed89a25000)
libfabric.so.1 => /syssw/fabric/1.21.0/lib/libfabric.so.1 (0x000014ed8989b000)
libefa.so.1 => /lib/x86_64-linux-gnu/libefa.so.1 (0x000014ed8988d000)
libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x000014ed8986c000)
libpsm2.so.2 => /syssw/psm2/12.0.1/lib/libpsm2.so.2 (0x000014ed89804000)
libatomic.so.1 => /sw/compiler/gcc-13.3.0/lib64/libatomic.so.1 (0x000014ed897fb000)
libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x000014ed8976a000)
libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x000014ed89745000)
libpmix.so.2 => /syssw/pmix/5.0.7/lib/libpmix.so.2 (0x000014ed8951e000)
libevent_core-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x000014ed894e8000)
libevent_pthreads-2.1.so.7 => /lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x000014ed894e3000)
libhwloc.so.15 => /syssw/hwloc/2.11.2/lib/libhwloc.so.15 (0x000014ed89486000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000014ed893a6000)
/lib64/ld-linux-x86-64.so.2 (0x000014ed8a0d8000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x000014ed89397000)
Can someone shed light on how the differing PMIx plugins are supposed
to work? Can someone share a setup where pmix_v5 does work with openmpi
5.x?
Alrighty then,
Thomas
--
Dr. Thomas Orgis
HPC @ Universität Hamburg