[slurm-users] [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

Fulcomer, Samuel samuel_fulcomer at brown.edu
Thu Oct 22 18:25:29 UTC 2020


Compile slurm without ucx support. We wound up spending quality time with
the Mellanox... wait, no, NVIDIA Networking UCX folks to get this sorted
out.

I recommend using SLURM 20 rather than 19.

regards,
s



On Thu, Oct 22, 2020 at 10:23 AM Michael Di Domenico <mdidomenico4 at gmail.com>
wrote:

> was there ever a result to this?  i'm seeing the same error message,
> but i'm not adding in all the environ flags like the original poster.
>
> On Wed, Jul 10, 2019 at 9:18 AM Daniel Letai <dani at letai.org.il> wrote:
> >
> > Thank you Artem,
> >
> >
> > I've made a mistake while typing the mail, in all cases it was
> 'OMPI_MCA_pml=ucx' and not as written. When I went over the mail before
> sending, I must have erroneously 'fixed' it for some reason.
> >
> >
> > ----
> >
> > Best regards,
> >
> > --Dani_L.
> >
> >
> > On 7/9/19 9:06 PM, Artem Polyakov wrote:
> >
> > Hello, Daniel
> >
> > Let me try to reproduce locally and get back to you.
> >
> > ----
> > Best regards,
> > Artem Y. Polyakov, PhD
> > Senior Architect, SW
> > Mellanox Technologies
> > ________________________________
> > От: pmix at googlegroups.com <pmix at googlegroups.com> от имени Daniel Letai
> <dani at letai.org.il>
> > Отправлено: Tuesday, July 9, 2019 3:25:22 AM
> > Кому: Slurm User Community List; pmix at googlegroups.com;
> ucx-group at elist.ornl.gov
> > Тема: [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with
> SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error
> >
> >
> > Cross posting to Slurm, PMIx and UCX lists.
> >
> >
> > Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm
> (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in:
> >
> >
> > [root at n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true
> SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
> OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
> SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export
> SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
> UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2
> /data/mpihello/mpihello
> >
> >
> > slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix:
> ERROR: ucp_ep_create failed: Input/output error
> > slurmstepd: error: n1 [0] pmixp_dconn.h:243 [pmixp_dconn_connect]
> mpi/pmix: ERROR: Cannot establish direct connection to n2 (1)
> > slurmstepd: error: n1 [0] pmixp_server.c:731 [_process_extended_hdr]
> mpi/pmix: ERROR: Unable to connect to 1
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix:
> ERROR: ucp_ep_create failed: Input/output error
> > slurmstepd: error: n2 [1] pmixp_dconn.h:243 [pmixp_dconn_connect]
> mpi/pmix: ERROR: Cannot establish direct connection to n1 (0)
> > slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT
> 2019-07-01T13:20:36 ***
> > slurmstepd: error: n2 [1] pmixp_server.c:731 [_process_extended_hdr]
> mpi/pmix: ERROR: Unable to connect to 0
> > srun: error: n2: task 1: Killed
> > srun: error: n1: task 0: Killed
> >
> >
> > However, the following works:
> >
> >
> > [root at n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
> SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
> OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
> SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export
> SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
> UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2
> /data/mpihello/mpihello
> >
> >
> > n2: Process 1 out of 2
> > n1: Process 0 out of 2
> >
> >
> > [root at n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
> SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
> OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
> SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export
> SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
> UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2
> /data/mpihello/mpihello
> >
> >
> > n2: Process 1 out of 2
> > n1: Process 0 out of 2
> >
> >
> > Executing mpirun directly (same env vars, without the slurm vars) works,
> so UCX appears to function correctly.
> >
> >
> > If both SLURM_PMIX_DIRECT_CONN_EARLY=true and
> SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout errors from
> mellanox/hcoll and glibc detected /data/mpihello/mpihello: malloc(): memory
> corruption (fast)
> >
> >
> > Can anyone help using PMIx direct connection with UCX in Slurm?
> >
> >
> >
> >
> > Some info about my setup:
> >
> >
> > UCX version
> >
> > [root at n1 ~]# ucx_info -v
> >
> > # UCT version=1.5.0 revision 02078b9
> > # configured with: --build=x86_64-redhat-linux-gnu
> --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu
> --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin
> --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share
> --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec
> --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man
> --infodir=/usr/share/info --disable-optimizations --disable-logging
> --disable-debug --disable-assertions --enable-mt --disable-params-check
> >
> >
> > Mellanox OFED version:
> >
> > [root at n1 ~]# ofed_info -s
> > OFED-internal-4.5-1.0.1:
> >
> >
> > Slurm:
> >
> > slurm was built with:
> > rpmbuild -ta slurm-19.05.0.tar.bz2 --without debug --with ucx --define
> '_with_pmix --with-pmix=/usr'
> >
> >
> > PMIx:
> >
> > [root at n1 ~]# pmix_info -c --parsable
> > config:user:root
> > config:timestamp:"Mon Mar 25 09:51:04 IST 2019"
> > config:host:slurm-test
> > config:cli: '--host=x86_64-redhat-linux-gnu'
> '--build=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr'
> '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
> '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
> '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var'
> '--sharedstatedir=/var/lib' '--mandir=/usr/share/man'
> '--infodir=/usr/share/info'
> >
> >
> > Thanks,
> >
> > --Dani_L.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "pmix" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to pmix+unsubscribe at googlegroups.com.
> > To post to this group, send email to pmix at googlegroups.com.
> > Visit this group at https://groups.google.com/group/pmix.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/pmix/ce4a81a4-b3f7-48ce-4b9c-a5ebb098862c%40letai.org.il
> .
> > For more options, visit https://groups.google.com/d/optout.
> > --
> > You received this message because you are subscribed to the Google
> Groups "pmix" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to pmix+unsubscribe at googlegroups.com.
> > To post to this group, send email to pmix at googlegroups.com.
> > Visit this group at https://groups.google.com/group/pmix.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/pmix/DB6PR0501MB2791254A7057631C06FA094DBAF10%40DB6PR0501MB2791.eurprd05.prod.outlook.com
> .
> > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "pmix" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to pmix+unsubscribe at googlegroups.com.
> > To post to this group, send email to pmix at googlegroups.com.
> > Visit this group at https://groups.google.com/group/pmix.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/pmix/62424be1-1f5f-43cb-9901-07a0a03915f0%40letai.org.il
> .
> > For more options, visit https://groups.google.com/d/optout.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201022/fbc33268/attachment-0001.htm>


More information about the slurm-users mailing list