[slurm-users] [External] Re: openmpi / UCX / srun

Max Quast max at quast.de
Thu Aug 13 22:08:01 UTC 2020


Hey stijn, 

thank you very much for the advice!

Answer to your questions:
Q: are you using rdma-core with mellanox ofed?
A: only mellanox ofed, no rdma-core

Q: and do you have any uverbs_write error messages in dmesg on the hosts?
A: Yes, I have!

I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's.
Is it better to build slurm without UCX support or should I simply install
rdma-core?

How do I use ucx together with OpenMPI and srun now? 
It works when I set this manually:
'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x
UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'.
But if I put srun before mpirun four tasks will be created, two on each
node.

Thanks for helping me!
-max

-----Ursprüngliche Nachricht-----
Von: Stijn De Weirdt <stijn.deweirdt at ugent.be> 
Gesendet: Mittwoch, 12. August 2020 22:30
An: slurm-users at lists.schedmd.com
Betreff: Re: [slurm-users] [External] Re: openmpi / UCX / srun

hi max,

are you using rdma-core with mellanox ofed? and do you have any uverbs_write
error messages in dmesg on the hosts? there is an issue with rdma vs tcp in
ucx+pmix when rdma-core is not used. the workaournd for the issue is to
start slurmd on the nodes with environment 'UCX_TLS=tcp,self,sm' (and not
set UCX_TLS in the application
environment) (so the ucx used by pmix does not do rdma, which is ok-ish; the
app itself will use default ucx which will pick rdma instead of tcp)

stijn

On 8/12/20 9:25 PM, Max Quast wrote:
> Hello Prentice,
> 
> sorry for that.
> 
> My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019:
> 
>  
> 
>> Hello,
> 
>>
> 
>> I am trying to use ucx with slurm/pmix and run into the error below.
> The following works using mpirun, but what I was hoping was the srun 
> equivalent fails.  Is there some flag or configuration I might be 
> missing for slurm?
> 
>>
> 
>> Works fine:
> 
>> mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca 
>> osc
> ucx ./hello
> 
>>
> 
>> does not work:
> 
>> srun -n 100 ./hello
> 
>> slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect]
> mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
> 
>> slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243
> [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct 
> connection to apcpu-005 (1)
> 
>> slurmstepd: error: apcpu-004 [0] pmixp_server.c:731
> [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1
> 
>> slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT
> 2019-06-17T13:30:11 ***
> 
>>
> 
>> The configurations for pmix, openmpi, slurm, ucx are the following 
>> (on
> Debian 8):
> 
>> pmix 3.1.2
> 
>> ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2
> 
>>
> 
>> openmpi 4.0.1
> 
>> ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1
> --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2
> --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2
> --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external 
> --disable-dlopen --without-verbs
> 
>>
> 
>> slurm 19.05.0
> 
>> ./configure --enable-debug --enable-x11
> --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm
> --prefix=/opt/apps/slurm/19.05.0 
> --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1
> 
>>
> 
>> ucx 1.5.1
> 
>> ./configure --enable-optimizations --disable-logging --disable-debug
> --disable-assertions --disable-params-check
> --prefix=/opt/apps/gcc-7_4/ucx/1.5.1
> 
>>
> 
>> Any advice is much appreciated.
> 
>>
> 
>> Best,
> 
>>
> 
>> -Dean
> 
>  
> 
>>> Max,
> 
>>> You didn't quote the original e-mail so I'm not sure what the
> original problem was, or who "you" is.
> 
>>> --
> 
>>> Prentice
> 
>>> On 8/12/20 6:55 AM, Max Quast wrote:
> 
>>> I am also trying to use ucx with slurm/PMIx and get the same error.
> Also mpirun with "--mca pml ucx" works fine.
> 
>>> 
> 
>>> Used versions:
> 
>>> Ubuntu 20.04
> 
>>> slurm 20.02.4
> 
>>> OMPI 4.0.4
> 
>>> PMIx 3.1.5
> 
>>> UCX 1.9.0-rc1
> 
>>> OFED 4.9
> 
>>> 
> 
>>> With ucx 1.8.1 I got a slightly different error:
> 
>>> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare]
> mpi/pmix: ERROR: Fail to init UCX: Unsupported operation
> 
>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72
> [pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd
> 
>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402
> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed
> 
>>> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() 
> failed
> 
>>> [2020-08-11T20:24:48.119] [2.0] error: Failed 
>>> mpi_hook_slurmstepd_prefork
> 
>>> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting
> abnormally, rc = -1
> 
>>> 
> 
>>> Did you solve the problem?
> 
>>> 
> 
>>> 
> 
>>> Greetings,
> 
>>> Max
> 
>>> --
> 
>>> Prentice
> 
>>> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5014 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200814/60cb555e/attachment.bin>


More information about the slurm-users mailing list