[slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging
Bill Broadley
bill at cse.ucdavis.edu
Tue May 8 18:23:56 MDT 2018
Greetings all,
I have slurm-17.11.5, pmix-1.2.4, and openmpi-3.0.1 working on several clusters.
I find srun handy for things like:
bill at headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1
c7-18 c7-19
size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec
Building was straight forward, I build the dependencies myself since I was using
a newer compiler. So I built libevent, pmix, hwloc and related against the
current compiler. Then built slurm + openmpi against those shared dependencies.
However I just tried a newer cluster with ubuntu-18.04, slurm-17.11.5,
openmpi-3.1 and pmix-2.1.1.
On the slurm side things looked promising, I removed any openmpi and pmix
related packages to ensure the packages I built were used. I can post the
complete log, but hopefully the lines returned with grep -i pmix are the most
helpful:
$ cat config.log | grep -i pmix
$ ./configure --prefix=/share/apps/slurm-17.11.5/gcc7
--with-pmix=/share/apps/pmix-2.1.1/gcc7
configure:21530: checking for pmix installation
configure:21565: gcc -o conftest -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread
-I/share/apps/pmix-2.1.1/gcc7/include conftest.c
-L/share/apps/pmix-2.1.1/gcc7/lib -lpmix >&5
configure:21596: gcc -E -I/share/apps/pmix-2.1.1/gcc7/include conftest.c
configure:21648: result: /share/apps/pmix-2.1.1/gcc7
| #define HAVE_PMIX 1
config.status:1697: creating src/plugins/mpi/pmix/Makefile
x_ac_cv_pmix_dir=/share/apps/pmix-2.1.1/gcc7
x_ac_cv_pmix_libdir=/share/apps/pmix-2.1.1/gcc7/lib
HAVE_PMIX_FALSE='#'
HAVE_PMIX_TRUE=''
HAVE_PMIX_V1_FALSE=''
HAVE_PMIX_V1_TRUE='#'
HAVE_PMIX_V2_FALSE='#'
HAVE_PMIX_V2_TRUE=''
PMIX_V1_CPPFLAGS=''
PMIX_V1_LDFLAGS=''
PMIX_V2_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include'
PMIX_V2_LDFLAGS='-Wl,-rpath -Wl,/share/apps/pmix-2.1.1/gcc7/lib
-L/share/apps/pmix-2.1.1/gcc7/lib'
#define HAVE_PMIX 1
Looks pretty promising so far. Some of the most relevant lines for openmpi-3.1 are:
OPAL_CONFIGURE_CLI=' \'\''--prefix=/share/apps/openmpi-3.1.0/gcc7\'\''
\'\''--with-pmix=/share/apps/pmix-2.1.1/gcc7\'\''
\'\''--with-libevent=external\'\'' \'\''--disable-io-romio\'\''
\'\''--disable-io-ompio\'\'''
opal_pmix_ext1x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include'
opal_pmix_ext1x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib'
opal_pmix_ext1x_LIBS='-lpmix'
opal_pmix_ext2x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include'
opal_pmix_ext2x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib'
opal_pmix_ext2x_LIBS='-lpmix'
opal_pmix_pmix2x_CPPFLAGS=''
opal_pmix_pmix2x_DEPENDENCIES=''
opal_pmix_pmix2x_LDFLAGS=''
opal_pmix_pmix2x_LIBS=''
pmix_alps_CPPFLAGS=''
pmix_alps_LDFLAGS=''
pmix_alps_LIBS=''
pmix_cray_CPPFLAGS=''
pmix_cray_LDFLAGS=''
pmix_cray_LIBS=''
#define OPAL_PMIX_V1 0
Looks pretty promising, the biggest difference I see between this non-working
setup and the working setups is that the working setups have:
#define OPAL_PMIX_V1 1
So when I try to run the above compiled slurm + openmpi-3.1 I get:
bill at demon:~/relay$ srun -N 2 -n 2 -t 1 ./relay 1
[c2-33:02763] OPAL ERROR: Not initialized in file ext2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[c2-33:02763] Local abort before MPI_INIT completed completed successfully, but
am not able to aggregate error messages, and not able to guarantee that all
other processes were killed!
[c2-31:17377] OPAL ERROR: Not initialized in file ext2x_client.c at line 109
Any ideas on how to debug the above? I was trying to use ldd to double check
what libraries things were compiled against, but I couldn't find any, even on
the working clusters.
It's possible of course that it's entirely an openmpi problem, I'll be
investigating and posting there if I can't find a solution.
More information about the slurm-users
mailing list