[slurm-users] [External] Re: openmpi / UCX / srun

Max Quast max at quast.de
Sun Aug 16 14:00:11 UTC 2020


hi stijn, 

 

> i would look into using mellanox ofed with rdma-core

I will try the new 5.1 Mellanox OFED in the near future.

 

> you let pmix do it's job and thus simply start the mpi parts with srun
instead of mpirun

In this case, the srun command works fine for 'srun -N2 -n2 --mpi=pmix
pingpong 100 100', but the IB connection is not used for the communication,
only the tcp connection.

 

The output of 'pmix_info' is:

                MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component

                      v3.1.5)

                MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component

                      v3.1.5)

                MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component
v3.1.5)

                MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component
v3.1.5)

                MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA plog: default (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA pnet: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA pnet: test (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA preg: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA psec: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA psec: none (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component

                      v3.1.5)

                MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA ptl: usock (MCA v2.1.0, API v1.0.0, Component v3.1.5)

                MCA ptl: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)

 

Isn't there supposed to appear something with ucx?

 

Thanks :)

max

 

> hi max,

> 

> > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's.

> > Is it better to build slurm without UCX support or should I simply 

> > install rdma-core?

> i would look into using mellanox ofed with rdma-core, as it is what
mellanox is shifting towards or has already done (not sure what 4.9 has
tbh). or leave the env vars, i think for pmix it's ok unless you have very
large clusters (but i'm no expert here).

> 

> > 

> > How do I use ucx together with OpenMPI and srun now? 

> > It works when I set this manually:

> > 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x

> > UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'.

> > But if I put srun before mpirun four tasks will be created, two on 

> > each node.

> you let pmix do it's job and thus simply start the mpi parts with srun
instead of mpirun

> 

> srun pingpong 1000 1000

> 

> if you must tune UCX (as in: default behaviour is not ok), also set it via
env vars. (at least try to use the defaults, it's pretty good i think)

> 

> (shameless plug: one of my colleagues setup a tech talk with openmpi
people wrt pmix, ucx, openmpi etc; see

> https://github.com/easybuilders/easybuild/issues/630 for details and link
to youtube recording)

> 

> stijn

> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200816/6dcfeb18/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5014 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200816/6dcfeb18/attachment.bin>


More information about the slurm-users mailing list