[slurm-users] Intermittent problem at 32 CPUs

Riebs, Andy andy.riebs at hpe.com
Fri Jun 5 13:29:35 UTC 2020


Diego,

I'm *guessing* that you are tripping over the use of "--tasks 32" on a heterogeneous cluster, though your comment about the node without InfiniBand troubles me. If you drain that node, or exclude it in your command line, that might correct the problem. I wonder if OMPI and PMIx have decided that IB is the way to go, and are failing when they try to set up on the node without IB.

If that's not it, I'd try
0. Check sacct for the node lists for the successful and unsuccessful runs -- a problem node might jump out.
1. Running your job with explicit node lists. Again, you may find a problem node this way.

HTH!
Andy

p.s. If this doesn't fix it, please include the Slurm and OMPI versions, and a copy of your slurm.conf file (with identifying information like node names removed) in your next note to this list.

-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Diego Zuccato
Sent: Friday, June 5, 2020 9:08 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Intermittent problem at 32 CPUs

Hello all.

I already tried for some weeks to debug this problem, but it seems I'm
still missing something.
I have a small, (very) heterogeneous cluster. After upgrading to Debian
10 and packaged versions of Slurm and IB drivers/tools, I noticed that
*sometimes* jobs requesting 32 or more threads fail with an error like:
-8<--
[str957-bl0-19:30411] *** Process received signal ***
[str957-bl0-19:30411] Signal: Segmentation fault (11)
[str957-bl0-19:30411] Signal code: Address not mapped (1)
[str957-bl0-19:30411] Failing at address: 0x7fb206380008
[str957-bl0-19:30411] [ 0]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7fb205eb7840]
[str957-bl0-19:30411] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7fb200ac2936]
[str957-bl0-19:30411] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7fb200a92733]
[str957-bl0-19:30411] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7fb200ac25b4]
[str957-bl0-19:30411] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7fb200bba46e]
[str957-bl0-19:30411] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7fb200b7288d]
[str957-bl0-19:30411] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7fb200b2ed7c]
[str957-bl0-19:30411] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7fb200c35fe4]
[str957-bl0-19:30411] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7fb201462656]
[str957-bl0-19:30411] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7fb202a9211a]
[str957-bl0-19:30411] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7fb203f23e62]
[str957-bl0-19:30411] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x55)[0x7fb203f522d5]
-8<--
Just changing --ntasks=32 to --ntasks=30 (or less) lets it run w/o problems.
*Sometimes* it works even with --ntasks=32.
But the most absurd thing I've seen is this (just changing the step in
the batch job):
-8<--
mpirun ./mpitest => KO
gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex
'thread apply all bt full' --args mpirun --mca btl openib --mca mtl psm2
./mpitest-debug => OK
mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK
mpirun --mca mtl psm2 ./mpitest-debug => OK
mpirun ./mpitest-debug => OK
mpirun ./mpitest => OK?!?!?!?!
-8<--

At the end, *the same* command that consistently failed, started to run.
The currently problematic node is one w/o InfiniBand, so that can
probably be ruled out.

Any hints?

TIA.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




More information about the slurm-users mailing list