[slurm-users] Segfault with 32 processes, OK with 30 ???

Diego Zuccato diego.zuccato at unibo.it
Mon Oct 12 09:43:36 UTC 2020


Il 08/10/20 08:48, Chris Samuel ha scritto:

Sorry for being so late. I've had to wait for the node to be free.

> Launch it with "srun" rather than "mpirun", that way it'll be managed by 
> Slurm.  If your test program then says every rank is rank 0 that will tell you 
> OpenMPI is not built with Slurm support.
Seems so:
"The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute."

So it seems I can't use srun to launch OpenMPI jobs.
But sust s/srun/mpirun (that, IIUC, should be supported) it seems to
work, and even auto-detects the corrent number of ranks to use.
I launched the test executable with mpirun on one of the newer nodes (56
threads) and got:
-8<--
[...]
Hello from task 52 on str957-mtx-11!
Hello from task 53 on str957-mtx-11!
Hello from task 54 on str957-mtx-11!
This is an MPI parallel code for Hello World with no communication
Hello from task 0 on str957-mtx-11!
MASTER: Number of MPI tasks is: 56
Hello from task 18 on str957-mtx-11!
[...]
-8<--
But if I run it on the older 32-thread node:
-8<--
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff480b700 (LWP 19633)]
[New Thread 0x7ffff3fe9700 (LWP 19634)]
[New Thread 0x7ffff3764700 (LWP 19635)]
[New Thread 0x7ffff2f63700 (LWP 19636)]
[Detaching after fork from child process 19637]
[Detaching after fork from child process 19638]
[Detaching after fork from child process 19639]
[Detaching after fork from child process 19641]
[Detaching after fork from child process 19643]
[Detaching after fork from child process 19645]
[Detaching after fork from child process 19647]
[Detaching after fork from child process 19649]
[Detaching after fork from child process 19651]
[Detaching after fork from child process 19653]
[Detaching after fork from child process 19655]
[Detaching after fork from child process 19657]
[Detaching after fork from child process 19659]
[Detaching after fork from child process 19661]
[Detaching after fork from child process 19663]
[Detaching after fork from child process 19665]
[Detaching after fork from child process 19667]
[Detaching after fork from child process 19669]
[Detaching after fork from child process 19671]
[Detaching after fork from child process 19673]
[Detaching after fork from child process 19675]
[Detaching after fork from child process 19677]
[Detaching after fork from child process 19679]
[Detaching after fork from child process 19681]
[Detaching after fork from child process 19683]
[Detaching after fork from child process 19685]
[Detaching after fork from child process 19687]
[Detaching after fork from child process 19689]
[Detaching after fork from child process 19691]
[Detaching after fork from child process 19693]
[Detaching after fork from child process 19695]
[Detaching after fork from child process 19697]
[str957-bl0-03:19637] *** Process received signal ***
[str957-bl0-03:19637] Signal: Segmentation fault (11)
[str957-bl0-03:19637] Signal code: Address not mapped (1)
[str957-bl0-03:19637] Failing at address: 0x7ffff7fac008
[str957-bl0-03:19637] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7ffff7e92730]
[str957-bl0-03:19637] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7ffff646d936]
[str957-bl0-03:19637] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7ffff6444733]
[str957-bl0-03:19637] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7ffff646d5b4]
[str957-bl0-03:19637] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7ffff659346e]
[str957-bl0-03:19637] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7ffff654b88d]
[str957-bl0-03:19637] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7ffff6507d7c]
[str957-bl0-03:19637] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7ffff6603fe4]
[str957-bl0-03:19637] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7ffff7fb1656]
[str957-bl0-03:19637] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7ffff7c1c11a]
[str957-bl0-03:19637] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7ffff7eece62]
[str957-bl0-03:19637] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7ffff7f1b17e]
[str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x5555555551c6]
[str957-bl0-03:19637] [13]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7ffff7ce309b]
[str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x5555555550da]
[str957-bl0-03:19637] *** End of error message ***
[... repeats the same error another 29 times ...]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[... repeats the error another 2 times ...]
[Thread 0x7ffff480b700 (LWP 19633) exited]
[Thread 0x7ffff3fe9700 (LWP 19634) exited]
[Thread 0x7ffff2f63700 (LWP 19636) exited]
[Thread 0x7ffff3764700 (LWP 19635) exited]
[Inferior 1 (process 19626) exited with code 0213]
No stack.
No stack.
-8<--

Some of the extra messages are from gdb. The job step line in the script
is :
gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex
'thread apply all bt full' --args srun ./mpitest-debug

The code is compiled w/ debug support.

I'm quite lost...

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list