[slurm-users] [External] Re: openmpi / UCX / srun
Max Quast
max at quast.de
Sun Aug 16 14:00:11 UTC 2020
hi stijn,
> i would look into using mellanox ofed with rdma-core
I will try the new 5.1 Mellanox OFED in the near future.
> you let pmix do it's job and thus simply start the mpi parts with srun
instead of mpirun
In this case, the srun command works fine for 'srun -N2 -n2 --mpi=pmix
pingpong 100 100', but the IB connection is not used for the communication,
only the tcp connection.
The output of 'pmix_info' is:
MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v3.1.5)
MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v3.1.5)
MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component
v3.1.5)
MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component
v3.1.5)
MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA plog: default (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA pnet: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA pnet: test (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA preg: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA psec: native (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA psec: none (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
v3.1.5)
MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA ptl: usock (MCA v2.1.0, API v1.0.0, Component v3.1.5)
MCA ptl: tcp (MCA v2.1.0, API v1.0.0, Component v3.1.5)
Isn't there supposed to appear something with ucx?
Thanks :)
max
> hi max,
>
> > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's.
> > Is it better to build slurm without UCX support or should I simply
> > install rdma-core?
> i would look into using mellanox ofed with rdma-core, as it is what
mellanox is shifting towards or has already done (not sure what 4.9 has
tbh). or leave the env vars, i think for pmix it's ok unless you have very
large clusters (but i'm no expert here).
>
> >
> > How do I use ucx together with OpenMPI and srun now?
> > It works when I set this manually:
> > 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x
> > UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'.
> > But if I put srun before mpirun four tasks will be created, two on
> > each node.
> you let pmix do it's job and thus simply start the mpi parts with srun
instead of mpirun
>
> srun pingpong 1000 1000
>
> if you must tune UCX (as in: default behaviour is not ok), also set it via
env vars. (at least try to use the defaults, it's pretty good i think)
>
> (shameless plug: one of my colleagues setup a tech talk with openmpi
people wrt pmix, ucx, openmpi etc; see
> https://github.com/easybuilders/easybuild/issues/630 for details and link
to youtube recording)
>
> stijn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200816/6dcfeb18/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5014 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200816/6dcfeb18/attachment.bin>
More information about the slurm-users
mailing list