[slurm-users] [External] Re: openmpi / UCX / srun

Stijn De Weirdt stijn.deweirdt at ugent.be
Fri Aug 14 12:27:22 UTC 2020


hi max,

> I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's.
> Is it better to build slurm without UCX support or should I simply install
> rdma-core?
i would look into using mellanox ofed with rdma-core, as it is what
mellanox is shifting towards or has already done (not sure what 4.9 has
tbh). or leave the env vars, i think for pmix it's ok unless you have
very large clusters (but i'm no expert here).

> 
> How do I use ucx together with OpenMPI and srun now? 
> It works when I set this manually:
> 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x
> UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'.
> But if I put srun before mpirun four tasks will be created, two on each
> node.
you let pmix do it's job and thus simply start the mpi parts with srun
instead of mpirun

srun pingpong 1000 1000

if you must tune UCX (as in: default behaviour is not ok), also set it
via env vars. (at least try to use the defaults, it's pretty good i think)

(shameless plug: one of my colleagues setup a tech talk with openmpi
people wrt pmix, ucx, openmpi etc; see
https://github.com/easybuilders/easybuild/issues/630 for details and
link to youtube recording)

stijn

> 
> Thanks for helping me!
> -max
> 
> -----Ursprüngliche Nachricht-----
> Von: Stijn De Weirdt <stijn.deweirdt at ugent.be> 
> Gesendet: Mittwoch, 12. August 2020 22:30
> An: slurm-users at lists.schedmd.com
> Betreff: Re: [slurm-users] [External] Re: openmpi / UCX / srun
> 
> hi max,
> 
> are you using rdma-core with mellanox ofed? and do you have any uverbs_write
> error messages in dmesg on the hosts? there is an issue with rdma vs tcp in
> ucx+pmix when rdma-core is not used. the workaournd for the issue is to
> start slurmd on the nodes with environment 'UCX_TLS=tcp,self,sm' (and not
> set UCX_TLS in the application
> environment) (so the ucx used by pmix does not do rdma, which is ok-ish; the
> app itself will use default ucx which will pick rdma instead of tcp)
> 
> stijn
> 
> On 8/12/20 9:25 PM, Max Quast wrote:
>> Hello Prentice,
>>
>> sorry for that.
>>
>> My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019:
>>
>>  
>>
>>> Hello,
>>
>>>
>>
>>> I am trying to use ucx with slurm/pmix and run into the error below.
>> The following works using mpirun, but what I was hoping was the srun 
>> equivalent fails.  Is there some flag or configuration I might be 
>> missing for slurm?
>>
>>>
>>
>>> Works fine:
>>
>>> mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca 
>>> osc
>> ucx ./hello
>>
>>>
>>
>>> does not work:
>>
>>> srun -n 100 ./hello
>>
>>> slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect]
>> mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
>>
>>> slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243
>> [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct 
>> connection to apcpu-005 (1)
>>
>>> slurmstepd: error: apcpu-004 [0] pmixp_server.c:731
>> [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1
>>
>>> slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT
>> 2019-06-17T13:30:11 ***
>>
>>>
>>
>>> The configurations for pmix, openmpi, slurm, ucx are the following 
>>> (on
>> Debian 8):
>>
>>> pmix 3.1.2
>>
>>> ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2
>>
>>>
>>
>>> openmpi 4.0.1
>>
>>> ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1
>> --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2
>> --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2
>> --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external 
>> --disable-dlopen --without-verbs
>>
>>>
>>
>>> slurm 19.05.0
>>
>>> ./configure --enable-debug --enable-x11
>> --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm
>> --prefix=/opt/apps/slurm/19.05.0 
>> --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1
>>
>>>
>>
>>> ucx 1.5.1
>>
>>> ./configure --enable-optimizations --disable-logging --disable-debug
>> --disable-assertions --disable-params-check
>> --prefix=/opt/apps/gcc-7_4/ucx/1.5.1
>>
>>>
>>
>>> Any advice is much appreciated.
>>
>>>
>>
>>> Best,
>>
>>>
>>
>>> -Dean
>>
>>  
>>
>>>> Max,
>>
>>>> You didn't quote the original e-mail so I'm not sure what the
>> original problem was, or who "you" is.
>>
>>>> --
>>
>>>> Prentice
>>
>>>> On 8/12/20 6:55 AM, Max Quast wrote:
>>
>>>> I am also trying to use ucx with slurm/PMIx and get the same error.
>> Also mpirun with "--mca pml ucx" works fine.
>>
>>>>  
>>
>>>> Used versions:
>>
>>>> Ubuntu 20.04
>>
>>>> slurm 20.02.4
>>
>>>> OMPI 4.0.4
>>
>>>> PMIx 3.1.5
>>
>>>> UCX 1.9.0-rc1
>>
>>>> OFED 4.9
>>
>>>>  
>>
>>>> With ucx 1.8.1 I got a slightly different error:
>>
>>>> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare]
>> mpi/pmix: ERROR: Fail to init UCX: Unsupported operation
>>
>>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72
>> [pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd
>>
>>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402
>> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed
>>
>>>> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161
>> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() 
>> failed
>>
>>>> [2020-08-11T20:24:48.119] [2.0] error: Failed 
>>>> mpi_hook_slurmstepd_prefork
>>
>>>> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting
>> abnormally, rc = -1
>>
>>>>  
>>
>>>> Did you solve the problem?
>>
>>>>  
>>
>>>>  
>>
>>>> Greetings,
>>
>>>> Max
>>
>>>> --
>>
>>>> Prentice
>>
>>>>  
>>



More information about the slurm-users mailing list