[slurm-users] [External] Re: openmpi / UCX / srun
Stijn De Weirdt
stijn.deweirdt at ugent.be
Sun Aug 16 19:59:15 UTC 2020
hi max,
>> you let pmix do it's job and thus simply start the mpi parts with srun
> instead of mpirun
>
> In this case, the srun command works fine for 'srun -N2 -n2 --mpi=pmix
> pingpong 100 100', but the IB connection is not used for the
> communication, only the tcp connection.
hmmmm, this odd. you should check if ucx is used at all (there are env
variables to make it verbose; it should also spit out what connection
options it uses).
how did you set the `UCX_TLS` var? are you sure it's not in the job env?
(do a "env | grep UCX" before the srun)
> The output of 'pmix_info' is:
>
> MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
>
> v3.1.5)
>
> MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
>
> v3.1.5)
>
> MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component
> v3.1.5)
>
> MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0,
> Component v3.1.5)
>
> MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA plog: default (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA pnet: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA pnet: test (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA preg: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA psec: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA psec: none (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
>
> v3.1.5)
>
> MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA ptl: usock (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
> MCA ptl: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)
>
>
>
> Isn't there supposed to appear something with ucx?
dunno, i would have to check our setup again.
stijn
>
>
>
> Thanks :)
>
> max
>
>
>
>> hi max,
>
>>
>
>> > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's.
>
>> > Is it better to build slurm without UCX support or should I simply
>
>> > install rdma-core?
>
>> i would look into using mellanox ofed with rdma-core, as it is what
> mellanox is shifting towards or has already done (not sure what 4.9 has
> tbh). or leave the env vars, i think for pmix it's ok unless you have
> very large clusters (but i'm no expert here).
>
>>
>
>> >
>
>> > How do I use ucx together with OpenMPI and srun now?
>
>> > It works when I set this manually:
>
>> > 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x
>
>> > UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'.
>
>> > But if I put srun before mpirun four tasks will be created, two on
>
>> > each node.
>
>> you let pmix do it's job and thus simply start the mpi parts with srun
> instead of mpirun
>
>>
>
>> srun pingpong 1000 1000
>
>>
>
>> if you must tune UCX (as in: default behaviour is not ok), also set it
> via env vars. (at least try to use the defaults, it's pretty good i think)
>
>>
>
>> (shameless plug: one of my colleagues setup a tech talk with openmpi
> people wrt pmix, ucx, openmpi etc; see
>
>> https://github.com/easybuilders/easybuild/issues/630 for details and
> link to youtube recording)
>
>>
>
>> stijn
>
>>
>
More information about the slurm-users
mailing list