[slurm-users] problems with slurm and openmpi
Riccardo Veraldi
riccardo.veraldi at gmail.com
Fri Mar 15 05:12:13 UTC 2019
I missed this step then to build pmix separately. I thought that the built in pmix inside openpmi could be used by slurm
>
> On Mar 14, 2019 at 9:32 PM, <Gilles Gouaillardet (mailto:gilles at rist.or.jp)> wrote:
>
>
>
> Riccardo,
>
>
> I am a bit confused by your explanation.
>
>
> Open MPI does embed PMIx, but only for itself.
>
> An other way to put it is you have to install pmix first (package or
> download from pmix.org)
>
> and then build SLURM on top of it.
>
>
> Then you can build Open MPI with the same (external) PMIx or the
> embedded one
>
> (since PMIx offers cross-version compatilibity support)
>
>
> Cheers,
>
>
> Gilles
>
>
> On 3/15/2019 12:24 PM, Riccardo Veraldi wrote:
> > thanks to all.
> > the problem is that slurm's configure is not able to find the pmix
> > includes
> >
> > configure:20846: checking for pmix installation
> > configure:21005: result:
> > configure:21021: WARNING: unable to locate pmix installation
> >
> > regardless of the path I give.
> > and the reason is that configure searches for the following includes:
> >
> > test -f "$d/include/pmix/pmix_common.h"
> > test -f "$d/include/pmix_server.h"
> >
> > but neither of the two are installed by openmpi.
> >
> > one of the two is in the openmpi soure code tarball
> >
> > ./opal/mca/pmix/pmix3x/pmix/include/pmix_server.h
> >
> > the other one is in a ".h.in" file. and not ".h"
> >
> > ./opal/mca/pmix/pmix3x/pmix/include/pmix_common.h.in
> >
> > anyway they do not get installed by the rpm.
> >
> > the last thing I can try is build directly openmpi from sources and
> > give up with the rpm package build. The openmpi .spec has also errors
> > which I had to fix manually to allow it to successfully build
> >
> >
> >
> > On 3/12/19 4:56 PM, Daniel Letai wrote:
> >> Hi.
> >> On 12/03/2019 22:53:36, Riccardo Veraldi wrote:
> >>> Hello,
> >>> after trynig hard for over 10 days I am forced to write to the list.
> >>> I am not able to have SLURM work with openmpi. Openmpi compiled
> >>> binaries won't run on slurm, while all non openmpi progs run just
> >>> fine under "srun". I am using SLURM 18.08.5 building the rpm from
> >>> the tarball: rpmbuild -ta slurm-18.08.5-2.tar.bz2
> >>> prior to bulid SLURM I installed openmpi 4.0.0 which has built in
> >>> pmix support. the pmix libraries are in /usr/lib64/pmix/ which is
> >>> the default installation path.
> >>>
> >>> The problem is that hellompi is not working if I launch in from
> >>> srun. of course it runs outside slurm.
> >>>
> >>> [psanagpu105:10995] OPAL ERROR: Not initialized in file
> >>> pmix3x_client.c at line 113
> >>> --------------------------------------------------------------------------
> >>> The application appears to have been direct launched using "srun",
> >>> but OMPI was not built with SLURM's PMI support and therefore cannot
> >>> execute. There are several options for building PMI support under
> >>
> >> I would guess (but having the config.log files would verify it) that
> >> you should rebuild Slurm --with-pmix and then you should rebuild
> >> OpenMPI --with Slurm.
> >>
> >> Currently there might be a bug in Slurm's configure file building
> >> PMIx support without path, so you might either modify the spec before
> >> building (add --with-pmix=/usr to the configure section) or for
> >> testing purposes ./configure --with-pmix=/usr; make; make install.
> >>
> >>
> >> It seems your current configuration has built-in mismatch - Slurm
> >> only supports pmi2, while OpenMPI only supports PMIx. you should
> >> build with at least one common PMI: either external PMIx when
> >> building Slurm, or Slurm's PMI2 when building OpenMPI.
> >>
> >> However, I would have expected the non-PMI option (srun
> >> --mpi=openmpi) to work even in your env, and Slurm should have built
> >> PMIx support automatically since it's in default search path.
> >>
> >>
> >>> SLURM, depending upon the SLURM version you are using:
> >>>
> >>> version 16.05 or later: you can use SLURM's PMIx support. This
> >>> requires that you configure and build SLURM --with-pmix.
> >>>
> >>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or
> >>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually
> >>> install PMI-2. You must then build Open MPI using --with-pmi pointing
> >>> to the SLURM PMI library location.
> >>>
> >>> Please configure as appropriate and try again.
> >>> --------------------------------------------------------------------------
> >>> *** An error occurred in MPI_Init
> >>> *** on a NULL communicator
> >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> >>> *** and potentially your MPI job)
> >>> [psanagpu105:10995] Local abort before MPI_INIT completed completed
> >>> successfully, but am not able to aggregate error messages, and not
> >>> able to guarantee that all other processes were killed!
> >>> srun: error: psanagpu105: task 0: Exited with exit code 1
> >>>
> >>> I really have no clue. I even reinstalled openmpi on a specific
> >>> different path /opt/openmpi/4.0.0
> >>> anyway seems like slurm does not know how to fine the MPI libraries
> >>> even though they are there and right now in the default path /usr/lib64
> >>>
> >>> even using --mpi=pmi2 or --mpi=openmpi does not fix the problem and
> >>> the same error message is given to me.
> >>> srun --mpi=list
> >>> srun: MPI types are...
> >>> srun: none
> >>> srun: openmpi
> >>> srun: pmi2
> >>>
> >>>
> >>> Any hint how could I fix this problem ?
> >>> thanks a lot
> >>>
> >>> Rick
> >>>
> >>>
> >> --
> >> Regards,
> >>
> >> Dani_L.
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190314/71f47ffa/attachment.html>
More information about the slurm-users
mailing list