[slurm-users] Intermittent problem at 32 CPUs

Fri Jun 5 13:07:59 UTC 2020

Hello all.

I already tried for some weeks to debug this problem, but it seems I'm
still missing something.
I have a small, (very) heterogeneous cluster. After upgrading to Debian
10 and packaged versions of Slurm and IB drivers/tools, I noticed that
*sometimes* jobs requesting 32 or more threads fail with an error like:
-8<--
[str957-bl0-19:30411] *** Process received signal ***
[str957-bl0-19:30411] Signal: Segmentation fault (11)
[str957-bl0-19:30411] Signal code: Address not mapped (1)
[str957-bl0-19:30411] Failing at address: 0x7fb206380008
[str957-bl0-19:30411] [ 0]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7fb205eb7840]
[str957-bl0-19:30411] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7fb200ac2936]
[str957-bl0-19:30411] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7fb200a92733]
[str957-bl0-19:30411] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7fb200ac25b4]
[str957-bl0-19:30411] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7fb200bba46e]
[str957-bl0-19:30411] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7fb200b7288d]
[str957-bl0-19:30411] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7fb200b2ed7c]
[str957-bl0-19:30411] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7fb200c35fe4]
[str957-bl0-19:30411] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7fb201462656]
[str957-bl0-19:30411] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7fb202a9211a]
[str957-bl0-19:30411] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7fb203f23e62]
[str957-bl0-19:30411] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x55)[0x7fb203f522d5]
-8<--
Just changing --ntasks=32 to --ntasks=30 (or less) lets it run w/o problems.
*Sometimes* it works even with --ntasks=32.
But the most absurd thing I've seen is this (just changing the step in
the batch job):
-8<--
mpirun ./mpitest => KO
gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex
'thread apply all bt full' --args mpirun --mca btl openib --mca mtl psm2
./mpitest-debug => OK
mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK
mpirun --mca mtl psm2 ./mpitest-debug => OK
mpirun ./mpitest-debug => OK
mpirun ./mpitest => OK?!?!?!?!
-8<--

At the end, *the same* command that consistently failed, started to run.
The currently problematic node is one w/o InfiniBand, so that can
probably be ruled out.

Any hints?

TIA.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786