[slurm-users] [External] Re:  openmpi / UCX / srun
    Max Quast 
    max at quast.de
       
    Wed Aug 12 19:25:20 UTC 2020
    
    
  
Hello Prentice, 
sorry for that. 
My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019:
 
> Hello,
> 
> I am trying to use ucx with slurm/pmix and run into the error below.  The
following works using mpirun, but what I was hoping was the srun equivalent
fails.  Is there some flag or configuration I might be missing for slurm?
> 
> Works fine:
> mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx
./hello
> 
> does not work:
> srun -n 100 ./hello
> slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect]
mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
> slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect]
mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1)
> slurmstepd: error: apcpu-004 [0] pmixp_server.c:731
[_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1
> slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT
2019-06-17T13:30:11 ***
> 
> The configurations for pmix, openmpi, slurm, ucx are the following (on
Debian 8):
> pmix 3.1.2
> ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2
> 
> openmpi 4.0.1
> ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1
--with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2
--with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2
--with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external
--disable-dlopen --without-verbs
> 
> slurm 19.05.0
> ./configure --enable-debug --enable-x11
--with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm
--prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1
> 
> ucx 1.5.1
> ./configure --enable-optimizations --disable-logging --disable-debug
--disable-assertions --disable-params-check
--prefix=/opt/apps/gcc-7_4/ucx/1.5.1
> 
> Any advice is much appreciated.
> 
> Best,
> 
> -Dean
 
>> Max, 
>> You didn't quote the original e-mail so I'm not sure what the original
problem was, or who "you" is. 
>> --
>> Prentice
>> On 8/12/20 6:55 AM, Max Quast wrote:
>> I am also trying to use ucx with slurm/PMIx and get the same error.  Also
mpirun with "--mca pml ucx" works fine.
>>  
>> Used versions: 
>> Ubuntu 20.04
>> slurm 20.02.4
>> OMPI 4.0.4
>> PMIx 3.1.5
>> UCX 1.9.0-rc1
>> OFED 4.9
>>  
>> With ucx 1.8.1 I got a slightly different error:
>> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare]
mpi/pmix: ERROR: Fail to init UCX: Unsupported operation
>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72
[pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd
>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed
>> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
>> [2020-08-11T20:24:48.119] [2.0] error: Failed mpi_hook_slurmstepd_prefork
>> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting abnormally, rc
= -1
>>  
>> Did you solve the problem?
>>  
>>  
>> Greetings,
>> Max
>> -- 
>> Prentice 
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200812/ea74ebe0/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5014 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200812/ea74ebe0/attachment.bin>
    
    
More information about the slurm-users
mailing list