<div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi everyone, <br></div><div>I am trying to run a simulation software on slurm using openmpi-4.1.1 and cuda/11.1.</div><div>On executing, I get the following error:</div><div><br></div><div>srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu --gres=gpu:1 --time=02:00:00 --pty bash -i</div><div>./<execultable></div><div><br></div><div><br></div><div>```._____________________________________________________________________________________<br>|<br>| Initial checks...<br>| All good.<br>|_____________________________________________________________________________________<br>[gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at line 112<br>--------------------------------------------------------------------------<br>The application appears to have been direct launched using "srun",<br>but OMPI was not built with SLURM's PMI support and therefore cannot<br>execute. There are several options for building PMI support under<br>SLURM, depending upon the SLURM version you are using:<br><br> version 16.05 or later: you can use SLURM's PMIx support. This<br> requires that you configure and build SLURM --with-pmix.<br><br> Versions earlier than 16.05: you must use either SLURM's PMI-1 or<br> PMI-2 support. SLURM builds PMI-1 by default, or you can manually<br> install PMI-2. You must then build Open MPI using --with-pmi pointing<br> to the SLURM PMI library location.<br><br>Please configure as appropriate and try again.<br>--------------------------------------------------------------------------<br>*** An error occurred in MPI_Init<br>*** on a NULL communicator<br>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,<br>*** and potentially your MPI job)<br>[gpu008:162305]
Local abort before MPI_INIT completed completed successfully, but am
not able to aggregate error messages, and not able to guarantee that all
other processes were killed!```</div><div><br></div><div><br></div><div>using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1</div><div>on using which mpic++ or mpirun or nvcc, I get the module paths only, which looks correct.</div><div>I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>, but still the same error.</div><div><br></div><div>[sakshamp.phy20.itbhu@login2 menura]$ srun --mpi=list<br>srun: MPI types are...<br>srun: cray_shasta<br>srun: none<br>srun: pmi2</div><div><br></div><div>What
should I do from here, been stuck on this error for 6 days now? If
there is any build difference, I will have to tell the sysadmin.</div><div>Since there is an openmpi pairing error with slurm, are there other error I could expect between cuda and openmpi?</div><div><br></div><div>Thanks</div></div></div></div>