[slurm-users] Segfault with 32 processes, OK with 30 ???

Diego Zuccato diego.zuccato at unibo.it
Mon Oct 5 11:05:15 UTC 2020


Hello all.

I'm seeing (again) this weird issue.
The same executable, launched with 32 processes crashes immediately,
while it runs flawlessy with only 30 processes.

The reported error is:
[str957-bl0-03:05271] *** Process received signal ***
[str957-bl0-03:05271] Signal: Segmentation fault (11)
[str957-bl0-03:05271] Signal code: Address not mapped (1)
[str957-bl0-03:05271] Failing at address: 0x7f3826fb4008
[str957-bl0-03:05271] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f3825df6730]
[str957-bl0-03:05271] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f3824553936]
[str957-bl0-03:05271] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f382452a733]
[str957-bl0-03:05271] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f38245535b4]
[str957-bl0-03:05271] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f382467946e]
[str957-bl0-03:05271] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f382463188d]
[str957-bl0-03:05271] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f38245edd7c]
[str957-bl0-03:05271] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f38246e9fe4]
[str957-bl0-03:05271] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f3826fb9656]
[str957-bl0-03:05271] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f3825b8011a]
[str957-bl0-03:05271] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f3825e50e62]
[str957-bl0-03:05271] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f3825e7f17e]
[str957-bl0-03:05271] [12] ./C-GenIC(+0x23b9)[0x55bf9fa8e3b9]
[str957-bl0-03:05271] [13]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f3825c4709b]
[str957-bl0-03:05271] [14] ./C-GenIC(+0x251a)[0x55bf9fa8e51a]
[str957-bl0-03:05271] *** End of error message ***


In the past, just installing gdb to try to debug it made the problem
disappear: obviously it was not a solution...

Any hint?

TIA

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list