[slurm-users] NAS benchmarks - problem with openmpi, slurm and pmi

Glenn (Gedaliah) Wolosh gwolosh at njit.edu
Thu Dec 7 10:37:01 MST 2017


Hello

This is using Slurm version - 17.02.6 running on Scientific Linux release 7.4 (Nitrogen)

[gwolosh at p-slogin bin]$ module li

Currently Loaded Modules:
  1) GCCcore/.5.4.0 (H)   2) binutils/.2.26 (H)   3) GCC/5.4.0-2.26   4) numactl/2.0.11   5) hwloc/1.11.3   6) OpenMPI/1.10.3

If I run

srun --nodes=8 --ntasks-per-node=8 --ntasks=64  ./ep.C.64

It runs successfuly but I get a message —

PMI2 initialized but returned bad values for size/rank/jobid.
This is symptomatic of either a failure to use the
"--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.
If running under SLURM, try adding "-mpi=pmi2" to your
srun command line. If that doesn't work, or if you are
not running under SLURM, try removing or renaming the
pmi2.h header file so PMI2 support will not automatically
be built, reconfigure and build OMPI, and then try again
with only PMI1 support enabled.

If I run

srun --nodes=8 --ntasks-per-node=8 --ntasks=64  —mpi=pmi2 ./ep.C.64

The job crashes

If I run via sbatch —

#!/bin/bash
# Job name:
#SBATCH --job-name=nas_bench
#SBATCH --nodes=8
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --output=nas.out.1
#
## Command(s) to run (example):
module use $HOME/easybuild/modules/all/Core
module load GCC/5.4.0-2.26 OpenMPI/1.10.3
mpirun -np 64  ./ep.C.64

the job crashes

Using easybuild, these are my config options for ompi —

configopts = '--with-threads=posix --enable-shared --enable-mpi-thread-multiple --with-verbs '
configopts += '--enable-mpirun-prefix-by-default '  # suppress failure modes in relation to mpirun path
configopts += '--with-hwloc=$EBROOTHWLOC '  # hwloc support
configopts += '--disable-dlopen '  # statically link component, don't do dynamic loading
configopts += '--with-slurm --with-pmi ‘

And finally —

$ ldd /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun | grep pmi
        libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000)
        libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000)

$ ompi_info | grep pmi
                  MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)
                 MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)
             MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
              MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)


Any suggestions?
_______________
Gedaliah Wolosh
IST Academic and Research Computing Systems (ARCS)
NJIT
GITC 2203
973 596 5437
gwolosh at njit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171207/e2da8917/attachment-0001.html>


More information about the slurm-users mailing list