[slurm-users] HELP: error between compilation and execution on gpu cluster
Saksham Pande 5-Year IDD Physics
saksham.pande.phy20 at itbhu.ac.in
Fri May 19 11:02:50 UTC 2023
Thank you for responding.
The output of ompi_info regarding configuration is
Configure command line: '--build=x86_64-redhat-linux-gnu'
'--host=x86_64-redhat-linux-gnu'
'--program-prefix='
'--disable-dependency-tracking'
'--prefix=/usr/mpi/gcc/openmpi-4.0.2a1'
'--exec-prefix=/usr/mpi/gcc/openmpi-4.0.2a1'
'--bindir=/usr/mpi/gcc/openmpi-4.0.2a1/bin'
'--sbindir=/usr/mpi/gcc/openmpi-4.0.2a1/sbin'
'--sysconfdir=/usr/mpi/gcc/openmpi-4.0.2a1/etc'
'--datadir=/usr/mpi/gcc/openmpi-4.0.2a1/share'
'--includedir=/usr/mpi/gcc/openmpi-4.0.2a1/include'
'--libdir=/usr/mpi/gcc/openmpi-4.0.2a1/lib64'
'--libexecdir=/usr/mpi/gcc/openmpi-4.0.2a1/libexec'
'--localstatedir=/var' '--sharedstatedir=/var/lib'
'--mandir=/usr/mpi/gcc/openmpi-4.0.2a1/share/man'
'--infodir=/usr/mpi/gcc/openmpi-4.0.2a1/share/info'
'--with-platform=contrib/platform/mellanox/optimized'
BUT
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.2)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component
v4.0.2)
MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.2)
MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.2)
MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
v4.0.2)
MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.2)
are also present and contain references to pmi and slurm.
On Fri, May 19, 2023 at 2:48 PM Juergen Salk <juergen.salk at uni-ulm.de>
wrote:
> Hi,
>
> I am not sure if this related to GPUs. I rather think the issue has to do
> with
> how your OpenMPI has been built.
>
> What does ompi_info command show? Look for "Configure command line" in
> the output. Does this include '--with-slurm' and '--with-pmi' flags?
>
> To my very best knowledge, both flags need to be set for OpenMPI to
> work with srun.
>
> Also see:
>
> https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps
>
> https://slurm.schedmd.com/mpi_guide.html#open_mpi
>
> Best regards
> Jürgen
>
>
> * Saksham Pande 5-Year IDD Physics <saksham.pande.phy20 at itbhu.ac.in>
> [230519 07:42]:
> > Hi everyone,
> > I am trying to run a simulation software on slurm using openmpi-4.1.1 and
> > cuda/11.1.
> > On executing, I get the following error:
> >
> > srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu
> --gres=gpu:1
> > --time=02:00:00 --pty bash -i
> > ./<execultable>
> >
> >
> >
> ```._____________________________________________________________________________________
> > |
> > | Initial checks...
> > | All good.
> >
> |_____________________________________________________________________________________
> > [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at
> line
> > 112
> >
> --------------------------------------------------------------------------
> > The application appears to have been direct launched using "srun",
> > but OMPI was not built with SLURM's PMI support and therefore cannot
> > execute. There are several options for building PMI support under
> > SLURM, depending upon the SLURM version you are using:
> >
> > version 16.05 or later: you can use SLURM's PMIx support. This
> > requires that you configure and build SLURM --with-pmix.
> >
> > Versions earlier than 16.05: you must use either SLURM's PMI-1 or
> > PMI-2 support. SLURM builds PMI-1 by default, or you can manually
> > install PMI-2. You must then build Open MPI using --with-pmi pointing
> > to the SLURM PMI library location.
> >
> > Please configure as appropriate and try again.
> >
> --------------------------------------------------------------------------
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > *** and potentially your MPI job)
> > [gpu008:162305] Local abort before MPI_INIT completed completed
> > successfully, but am not able to aggregate error messages, and not able
> to
> > guarantee that all other processes were killed!```
> >
> >
> > using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1
> > on using which mpic++ or mpirun or nvcc, I get the module paths only,
> which
> > looks correct.
> > I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>,
> > but still the same error.
> >
> > [sakshamp.phy20.itbhu at login2 menura]$ srun --mpi=list
> > srun: MPI types are...
> > srun: cray_shasta
> > srun: none
> > srun: pmi2
> >
> > What should I do from here, been stuck on this error for 6 days now? If
> > there is any build difference, I will have to tell the sysadmin.
> > Since there is an openmpi pairing error with slurm, are there other
> error I
> > could expect between cuda and openmpi?
> >
> > Thanks
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/ce6ff5dd/attachment.htm>
More information about the slurm-users
mailing list