[slurm-users] HELP: error between compilation and execution on gpu cluster

Fri May 19 09:14:07 UTC 2023

Hi,

I am not sure if this related to GPUs. I rather think the issue has to do with 
how your OpenMPI has been built.

What does ompi_info command show? Look for "Configure command line" in
the output. Does this include '--with-slurm' and '--with-pmi' flags? 

To my very best knowledge, both flags need to be set for OpenMPI to
work with srun. 

Also see:

https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps

https://slurm.schedmd.com/mpi_guide.html#open_mpi

Best regards
Jürgen

* Saksham Pande 5-Year IDD Physics <saksham.pande.phy20 at itbhu.ac.in> [230519 07:42]:
> Hi everyone,
> I am trying to run a simulation software on slurm using openmpi-4.1.1 and
> cuda/11.1.
> On executing, I get the following error:
> 
> srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu --gres=gpu:1
> --time=02:00:00 --pty bash -i
> ./<execultable>
> 
> 
> ```._____________________________________________________________________________________
> |
> | Initial checks...
> | All good.
> |_____________________________________________________________________________________
> [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at line
> 112
> --------------------------------------------------------------------------
> The application appears to have been direct launched using "srun",
> but OMPI was not built with SLURM's PMI support and therefore cannot
> execute. There are several options for building PMI support under
> SLURM, depending upon the SLURM version you are using:
> 
>   version 16.05 or later: you can use SLURM's PMIx support. This
>   requires that you configure and build SLURM --with-pmix.
> 
>   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>   install PMI-2. You must then build Open MPI using --with-pmi pointing
>   to the SLURM PMI library location.
> 
> Please configure as appropriate and try again.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [gpu008:162305] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!```
> 
> 
> using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1
> on using which mpic++ or mpirun or nvcc, I get the module paths only, which
> looks correct.
> I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>,
> but still the same error.
> 
> [sakshamp.phy20.itbhu at login2 menura]$ srun --mpi=list
> srun: MPI types are...
> srun: cray_shasta
> srun: none
> srun: pmi2
> 
> What should I do from here, been stuck on this error for 6 days now? If
> there is any build difference, I will have to tell the sysadmin.
> Since there is an openmpi pairing error with slurm, are there other error I
> could expect between cuda and openmpi?
> 
> Thanks