<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=DE link="#0563C1" vlink="#954F72"><div class=WordSection1><p class=MsoNormal>Hello Prentice, <o:p></o:p></p><p class=MsoNormal>sorry for that. <o:p></o:p></p><p class=MsoNormal>My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>> Hello,<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> I am trying to use ucx with slurm/pmix and run into the error below. The following works using mpirun, but what I was hoping was the srun equivalent fails. Is there some flag or configuration I might be missing for slurm?<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> Works fine:<o:p></o:p></p><p class=MsoNormal>> mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx ./hello<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> does not work:<o:p></o:p></p><p class=MsoNormal>> srun -n 100 ./hello<o:p></o:p></p><p class=MsoNormal>> slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error<o:p></o:p></p><p class=MsoNormal>> slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1)<o:p></o:p></p><p class=MsoNormal>> slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1<o:p></o:p></p><p class=MsoNormal>> slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT 2019-06-17T13:30:11 ***<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> The configurations for pmix, openmpi, slurm, ucx are the following (on Debian 8):<o:p></o:p></p><p class=MsoNormal>> pmix 3.1.2<o:p></o:p></p><p class=MsoNormal>> ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> openmpi 4.0.1<o:p></o:p></p><p class=MsoNormal>> ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external --disable-dlopen --without-verbs<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> slurm 19.05.0<o:p></o:p></p><p class=MsoNormal>> ./configure --enable-debug --enable-x11 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm --prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> ucx 1.5.1<o:p></o:p></p><p class=MsoNormal>> ./configure --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/opt/apps/gcc-7_4/ucx/1.5.1<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> Any advice is much appreciated.<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> Best,<o:p></o:p></p><p class=MsoNormal>> <o:p></o:p></p><p class=MsoNormal>> -Dean<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal> <o:p></o:p></p><p class=MsoNormal>>> Max, <o:p></o:p></p><p class=MsoNormal>>> You didn't quote the original e-mail so I'm not sure what the original problem was, or who "you" is. <o:p></o:p></p><p class=MsoNormal>>> --<o:p></o:p></p><p class=MsoNormal>>> Prentice<o:p></o:p></p><p class=MsoNormal>>> On 8/12/20 6:55 AM, Max Quast wrote:<o:p></o:p></p><p class=MsoNormal>>> I am also trying to use ucx with slurm/PMIx and get the same error. Also mpirun with "--mca pml ucx" works fine.<o:p></o:p></p><p class=MsoNormal>>> <o:p></o:p></p><p class=MsoNormal>>> Used versions: <o:p></o:p></p><p class=MsoNormal>>> Ubuntu 20.04<o:p></o:p></p><p class=MsoNormal>>> slurm 20.02.4<o:p></o:p></p><p class=MsoNormal>>> OMPI 4.0.4<o:p></o:p></p><p class=MsoNormal>>> PMIx 3.1.5<o:p></o:p></p><p class=MsoNormal>>> UCX 1.9.0-rc1<o:p></o:p></p><p class=MsoNormal>>> OFED 4.9<o:p></o:p></p><p class=MsoNormal>>> <o:p></o:p></p><p class=MsoNormal>>> With ucx 1.8.1 I got a slightly different error:<o:p></o:p></p><p class=MsoNormal>>> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare] mpi/pmix: ERROR: Fail to init UCX: Unsupported operation<o:p></o:p></p><p class=MsoNormal>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72 [pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd<o:p></o:p></p><p class=MsoNormal>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402 [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed<o:p></o:p></p><p class=MsoNormal>>> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed<o:p></o:p></p><p class=MsoNormal>>> [2020-08-11T20:24:48.119] [2.0] error: Failed mpi_hook_slurmstepd_prefork<o:p></o:p></p><p class=MsoNormal>>> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting abnormally, rc = -1<o:p></o:p></p><p class=MsoNormal>>> <o:p></o:p></p><p class=MsoNormal>>> Did you solve the problem?<o:p></o:p></p><p class=MsoNormal>>> <o:p></o:p></p><p class=MsoNormal>>> <o:p></o:p></p><p class=MsoNormal>>> Greetings,<o:p></o:p></p><p class=MsoNormal>>> Max<o:p></o:p></p><p class=MsoNormal>>> -- <o:p></o:p></p><p class=MsoNormal>>> Prentice <o:p></o:p></p><p class=MsoNormal>>><o:p> </o:p></p></div></body></html>