<div dir="ltr"><div>Thank you for responding.</div><div>The output of ompi_info regarding configuration is <br></div><div><br></div><div>Configure command line: '--build=x86_64-redhat-linux-gnu'<br> '--host=x86_64-redhat-linux-gnu'<br> '--program-prefix=' '--disable-dependency-tracking'<br> '--prefix=/usr/mpi/gcc/openmpi-4.0.2a1'<br> '--exec-prefix=/usr/mpi/gcc/openmpi-4.0.2a1'<br> '--bindir=/usr/mpi/gcc/openmpi-4.0.2a1/bin'<br> '--sbindir=/usr/mpi/gcc/openmpi-4.0.2a1/sbin'<br> '--sysconfdir=/usr/mpi/gcc/openmpi-4.0.2a1/etc'<br> '--datadir=/usr/mpi/gcc/openmpi-4.0.2a1/share'<br> '--includedir=/usr/mpi/gcc/openmpi-4.0.2a1/include'<br> '--libdir=/usr/mpi/gcc/openmpi-4.0.2a1/lib64'<br> '--libexecdir=/usr/mpi/gcc/openmpi-4.0.2a1/libexec'<br> '--localstatedir=/var' '--sharedstatedir=/var/lib'<br> '--mandir=/usr/mpi/gcc/openmpi-4.0.2a1/share/man'<br> '--infodir=/usr/mpi/gcc/openmpi-4.0.2a1/share/info'<br> '--with-platform=contrib/platform/mellanox/optimized'</div><div><br></div><div>BUT <br> MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.2)<br> MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.2)<br> MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.2)</div><div> MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.2)<br> MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.0.2)<br> MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.2)</div><div>are also present and contain references to pmi and slurm.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 19, 2023 at 2:48 PM Juergen Salk <<a href="mailto:juergen.salk@uni-ulm.de">juergen.salk@uni-ulm.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I am not sure if this related to GPUs. I rather think the issue has to do with <br>
how your OpenMPI has been built.<br>
<br>
What does ompi_info command show? Look for "Configure command line" in<br>
the output. Does this include '--with-slurm' and '--with-pmi' flags? <br>
<br>
To my very best knowledge, both flags need to be set for OpenMPI to<br>
work with srun. <br>
<br>
Also see:<br>
<br>
<a href="https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps" rel="noreferrer" target="_blank">https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps</a><br>
<br>
<a href="https://slurm.schedmd.com/mpi_guide.html#open_mpi" rel="noreferrer" target="_blank">https://slurm.schedmd.com/mpi_guide.html#open_mpi</a><br>
<br>
Best regards<br>
Jürgen<br>
<br>
<br>
* Saksham Pande 5-Year IDD Physics <<a href="mailto:saksham.pande.phy20@itbhu.ac.in" target="_blank">saksham.pande.phy20@itbhu.ac.in</a>> [230519 07:42]:<br>
> Hi everyone,<br>
> I am trying to run a simulation software on slurm using openmpi-4.1.1 and<br>
> cuda/11.1.<br>
> On executing, I get the following error:<br>
> <br>
> srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu --gres=gpu:1<br>
> --time=02:00:00 --pty bash -i<br>
> ./<execultable><br>
> <br>
> <br>
> ```._____________________________________________________________________________________<br>
> |<br>
> | Initial checks...<br>
> | All good.<br>
> |_____________________________________________________________________________________<br>
> [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at line<br>
> 112<br>
> --------------------------------------------------------------------------<br>
> The application appears to have been direct launched using "srun",<br>
> but OMPI was not built with SLURM's PMI support and therefore cannot<br>
> execute. There are several options for building PMI support under<br>
> SLURM, depending upon the SLURM version you are using:<br>
> <br>
> version 16.05 or later: you can use SLURM's PMIx support. This<br>
> requires that you configure and build SLURM --with-pmix.<br>
> <br>
> Versions earlier than 16.05: you must use either SLURM's PMI-1 or<br>
> PMI-2 support. SLURM builds PMI-1 by default, or you can manually<br>
> install PMI-2. You must then build Open MPI using --with-pmi pointing<br>
> to the SLURM PMI library location.<br>
> <br>
> Please configure as appropriate and try again.<br>
> --------------------------------------------------------------------------<br>
> *** An error occurred in MPI_Init<br>
> *** on a NULL communicator<br>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,<br>
> *** and potentially your MPI job)<br>
> [gpu008:162305] Local abort before MPI_INIT completed completed<br>
> successfully, but am not able to aggregate error messages, and not able to<br>
> guarantee that all other processes were killed!```<br>
> <br>
> <br>
> using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1<br>
> on using which mpic++ or mpirun or nvcc, I get the module paths only, which<br>
> looks correct.<br>
> I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>,<br>
> but still the same error.<br>
> <br>
> [sakshamp.phy20.itbhu@login2 menura]$ srun --mpi=list<br>
> srun: MPI types are...<br>
> srun: cray_shasta<br>
> srun: none<br>
> srun: pmi2<br>
> <br>
> What should I do from here, been stuck on this error for 6 days now? If<br>
> there is any build difference, I will have to tell the sysadmin.<br>
> Since there is an openmpi pairing error with slurm, are there other error I<br>
> could expect between cuda and openmpi?<br>
> <br>
> Thanks<br>
<br>
<br>
</blockquote></div>