[slurm-users] problems with slurm and openmpi
Riccardo Veraldi
riccardo.veraldi at gmail.com
Fri Mar 15 03:24:21 UTC 2019
thanks to all.
the problem is that slurm's configure is not able to find the pmix includes
configure:20846: checking for pmix installation
configure:21005: result:
configure:21021: WARNING: unable to locate pmix installation
regardless of the path I give.
and the reason is that configure searches for the following includes:
test -f "$d/include/pmix/pmix_common.h"
test -f "$d/include/pmix_server.h"
but neither of the two are installed by openmpi.
one of the two is in the openmpi soure code tarball
./opal/mca/pmix/pmix3x/pmix/include/pmix_server.h
the other one is in a ".h.in" file. and not ".h"
./opal/mca/pmix/pmix3x/pmix/include/pmix_common.h.in
anyway they do not get installed by the rpm.
the last thing I can try is build directly openmpi from sources and give
up with the rpm package build. The openmpi .spec has also errors which I
had to fix manually to allow it to successfully build
On 3/12/19 4:56 PM, Daniel Letai wrote:
> Hi.
> On 12/03/2019 22:53:36, Riccardo Veraldi wrote:
>> Hello,
>> after trynig hard for over 10 days I am forced to write to the list.
>> I am not able to have SLURM work with openmpi. Openmpi compiled
>> binaries won't run on slurm, while all non openmpi progs run just
>> fine under "srun". I am using SLURM 18.08.5 building the rpm from the
>> tarball: rpmbuild -ta slurm-18.08.5-2.tar.bz2
>> prior to bulid SLURM I installed openmpi 4.0.0 which has built in
>> pmix support. the pmix libraries are in /usr/lib64/pmix/ which is the
>> default installation path.
>>
>> The problem is that hellompi is not working if I launch in from srun.
>> of course it runs outside slurm.
>>
>> [psanagpu105:10995] OPAL ERROR: Not initialized in file
>> pmix3x_client.c at line 113
>> --------------------------------------------------------------------------
>> The application appears to have been direct launched using "srun",
>> but OMPI was not built with SLURM's PMI support and therefore cannot
>> execute. There are several options for building PMI support under
>
> I would guess (but having the config.log files would verify it) that
> you should rebuild Slurm --with-pmix and then you should rebuild
> OpenMPI --with Slurm.
>
> Currently there might be a bug in Slurm's configure file building PMIx
> support without path, so you might either modify the spec before
> building (add --with-pmix=/usr to the configure section) or for
> testing purposes ./configure --with-pmix=/usr; make; make install.
>
>
> It seems your current configuration has built-in mismatch - Slurm only
> supports pmi2, while OpenMPI only supports PMIx. you should build with
> at least one common PMI: either external PMIx when building Slurm, or
> Slurm's PMI2 when building OpenMPI.
>
> However, I would have expected the non-PMI option (srun --mpi=openmpi)
> to work even in your env, and Slurm should have built PMIx support
> automatically since it's in default search path.
>
>
>> SLURM, depending upon the SLURM version you are using:
>>
>> version 16.05 or later: you can use SLURM's PMIx support. This
>> requires that you configure and build SLURM --with-pmix.
>>
>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>> install PMI-2. You must then build Open MPI using --with-pmi pointing
>> to the SLURM PMI library location.
>>
>> Please configure as appropriate and try again.
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> *** and potentially your MPI job)
>> [psanagpu105:10995] Local abort before MPI_INIT completed completed
>> successfully, but am not able to aggregate error messages, and not
>> able to guarantee that all other processes were killed!
>> srun: error: psanagpu105: task 0: Exited with exit code 1
>>
>> I really have no clue. I even reinstalled openmpi on a specific
>> different path /opt/openmpi/4.0.0
>> anyway seems like slurm does not know how to fine the MPI libraries
>> even though they are there and right now in the default path /usr/lib64
>>
>> even using --mpi=pmi2 or --mpi=openmpi does not fix the problem and
>> the same error message is given to me.
>> srun --mpi=list
>> srun: MPI types are...
>> srun: none
>> srun: openmpi
>> srun: pmi2
>>
>>
>> Any hint how could I fix this problem ?
>> thanks a lot
>>
>> Rick
>>
>>
> --
> Regards,
>
> Dani_L.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190314/489d2194/attachment.html>
More information about the slurm-users
mailing list