[slurm-users] NAS benchmarks - problem with openmpi, slurm and pmi

Artem Polyakov artpol84 at gmail.com
Thu Dec 7 10:51:20 MST 2017


also please post the output of
$ srun --mpi=list

When job crashes - is there any error messages in the relevant slurmd.log's
or output on the screen?

2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpol84 at gmail.com>:

> Hello,
>
> what is the value of MpiDefault option in your Slurm configuration file?
>
> 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwolosh at njit.edu>:
>
>> Hello
>>
>> This is using Slurm version - 17.02.6 running on Scientific Linux release
>> 7.4 (Nitrogen)
>>
>> [gwolosh at p-slogin bin]$ module li
>>
>> Currently Loaded Modules:
>>   1) GCCcore/.5.4.0 (H)   2) binutils/.2.26 (H)   3) GCC/5.4.0-2.26   4)
>> numactl/2.0.11   5) hwloc/1.11.3   6) OpenMPI/1.10.3
>>
>> If I run
>>
>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64  ./ep.C.64
>>
>> It runs successfuly but I get a message —
>>
>> PMI2 initialized but returned bad values for size/rank/jobid.
>> This is symptomatic of either a failure to use the
>> "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.
>> If running under SLURM, try adding "-mpi=pmi2" to your
>> srun command line. If that doesn't work, or if you are
>> not running under SLURM, try removing or renaming the
>> pmi2.h header file so PMI2 support will not automatically
>> be built, reconfigure and build OMPI, and then try again
>> with only PMI1 support enabled.
>>
>> If I run
>>
>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64  —mpi=pmi2 ./ep.C.64
>>
>> The job crashes
>>
>> If I run via sbatch —
>>
>> #!/bin/bash
>> # Job name:
>> #SBATCH --job-name=nas_bench
>> #SBATCH --nodes=8
>> #SBATCH --ntasks=64
>> #SBATCH --ntasks-per-node=8
>> #SBATCH --time=48:00:00
>> #SBATCH --output=nas.out.1
>> #
>> ## Command(s) to run (example):
>> module use $HOME/easybuild/modules/all/Core
>> module load GCC/5.4.0-2.26 OpenMPI/1.10.3
>> mpirun -np 64  ./ep.C.64
>>
>> the job crashes
>>
>> Using easybuild, these are my config options for ompi —
>>
>> configopts = '--with-threads=posix --enable-shared
>> --enable-mpi-thread-multiple --with-verbs '
>> configopts += '--enable-mpirun-prefix-by-default '  # suppress failure
>> modes in relation to mpirun path
>> configopts += '--with-hwloc=$EBROOTHWLOC '  # hwloc support
>> configopts += '--disable-dlopen '  # statically link component, don't do
>> dynamic loading
>> configopts += '--with-slurm --with-pmi ‘
>>
>> And finally —
>>
>> $ ldd /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun
>> | grep pmi
>>         libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000)
>>         libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000)
>>
>> $ ompi_info | grep pmi
>>                   MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)
>>                  MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)
>>              MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
>>               MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
>>
>>
>> Any suggestions?
>> _______________
>> Gedaliah Wolosh
>> IST Academic and Research Computing Systems (ARCS)
>> NJIT
>> GITC 2203
>> 973 596 5437 <(973)%20596-5437>
>> gwolosh at njit.edu
>>
>>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171207/29bb4910/attachment.html>


More information about the slurm-users mailing list