[slurm-users] openmpi / UCX / srun

Hidas, Dean dhidas at bnl.gov
Mon Jun 17 17:40:56 UTC 2019


I am trying to use ucx with slurm/pmix and run into the error below.  The following works using mpirun, but what I was hoping was the srun equivalent fails.  Is there some flag or configuration I might be missing for slurm?

Works fine:
mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx ./hello

does not work:
srun -n 100 ./hello
slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1)
slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1
slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT 2019-06-17T13:30:11 ***

The configurations for pmix, openmpi, slurm, ucx are the following (on Debian 8):
pmix 3.1.2
./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2

openmpi 4.0.1
./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external --disable-dlopen --without-verbs

slurm 19.05.0
./configure --enable-debug --enable-x11 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm --prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1

ucx 1.5.1
./configure --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/opt/apps/gcc-7_4/ucx/1.5.1

Any advice is much appreciated.



More information about the slurm-users mailing list