[slurm-users] [External] Re: openmpi / UCX / srun

Stijn De Weirdt stijn.deweirdt at ugent.be
Sun Aug 16 19:59:15 UTC 2020


hi max,

>> you let pmix do it's job and thus simply start the mpi parts with srun
> instead of mpirun
> 
> In this case, the srun command works fine for 'srun -N2 -n2 --mpi=pmix
> pingpong 100 100', but the IB connection is not used for the
> communication, only the tcp connection.

hmmmm, this odd. you should check if ucx is used at all (there are env
variables to make it verbose; it should also spit out what connection
options it uses).

how did you set the `UCX_TLS` var? are you sure it's not in the job env?
(do a "env | grep UCX" before the srun)


> The output of 'pmix_info' is:
> 
>                 MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
> 
>                       v3.1.5)
> 
>                 MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
> 
>                       v3.1.5)
> 
>                 MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component
> v3.1.5)
> 
>                 MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0,
> Component v3.1.5)
> 
>                 MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA plog: default (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA pnet: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA pnet: test (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA preg: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA psec: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA psec: none (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
> 
>                       v3.1.5)
> 
>                 MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA ptl: usock (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>                 MCA ptl: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)
> 
>  
> 
> Isn't there supposed to appear something with ucx?
dunno, i would have to check our setup again.

stijn

> 
>  
> 
> Thanks :)
> 
> max
> 
>  
> 
>> hi max,
> 
>>
> 
>> > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's.
> 
>> > Is it better to build slurm without UCX support or should I simply
> 
>> > install rdma-core?
> 
>> i would look into using mellanox ofed with rdma-core, as it is what
> mellanox is shifting towards or has already done (not sure what 4.9 has
> tbh). or leave the env vars, i think for pmix it's ok unless you have
> very large clusters (but i'm no expert here).
> 
>>
> 
>> >
> 
>> > How do I use ucx together with OpenMPI and srun now?
> 
>> > It works when I set this manually:
> 
>> > 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x
> 
>> > UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'.
> 
>> > But if I put srun before mpirun four tasks will be created, two on
> 
>> > each node.
> 
>> you let pmix do it's job and thus simply start the mpi parts with srun
> instead of mpirun
> 
>>
> 
>> srun pingpong 1000 1000
> 
>>
> 
>> if you must tune UCX (as in: default behaviour is not ok), also set it
> via env vars. (at least try to use the defaults, it's pretty good i think)
> 
>>
> 
>> (shameless plug: one of my colleagues setup a tech talk with openmpi
> people wrt pmix, ucx, openmpi etc; see
> 
>> https://github.com/easybuilders/easybuild/issues/630 for details and
> link to youtube recording)
> 
>>
> 
>> stijn
> 
>> 
> 



More information about the slurm-users mailing list