[slurm-users] Segfault with 32 processes, OK with 30 ???

Riebs, Andy andy.riebs at hpe.com
Mon Oct 5 12:18:26 UTC 2020


You need to provide some hints! What we know so far:

1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x backtrace.
2. Your decision to address this to the Slurm mailing list suggests that you think that Slurm might be involved.
3. You have something (a job? a program?) that segfaults when you go from 30 to 32 processes.

At a minimum, it would help your readers' understanding, and ability to help, to know:

a. What operating system?
b. Are you seeing this while running Slurm? What version?
c. What version of Open MPI?
d. Are you building your own PMI-x, or are you using what's provided by Open MPI and Slurm?
e. What does your hardware configuration look like -- particularly, what cpu type(s), and how many cores/node?
f. What does you Slurm configuration look like (assuming you're seeing this with Slurm)? I suggest purging your configuration files of node names and IP addresses, and including them with your query.
g. What does your command line look like? Especially, are you trying to run 32 processes on a single node? Spreading them out across 2 or more nodes?
h. Can you reproduce the problem if you substitute `hostname` or `true` for the program in the command line? What about a simple MPI-enabled "hello world?"

Andy

-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Diego Zuccato
Sent: Monday, October 5, 2020 7:05 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Segfault with 32 processes, OK with 30 ???

Hello all.

I'm seeing (again) this weird issue.
The same executable, launched with 32 processes crashes immediately,
while it runs flawlessy with only 30 processes.

The reported error is:
[str957-bl0-03:05271] *** Process received signal ***
[str957-bl0-03:05271] Signal: Segmentation fault (11)
[str957-bl0-03:05271] Signal code: Address not mapped (1)
[str957-bl0-03:05271] Failing at address: 0x7f3826fb4008
[str957-bl0-03:05271] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f3825df6730]
[str957-bl0-03:05271] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f3824553936]
[str957-bl0-03:05271] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f382452a733]
[str957-bl0-03:05271] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f38245535b4]
[str957-bl0-03:05271] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f382467946e]
[str957-bl0-03:05271] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f382463188d]
[str957-bl0-03:05271] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f38245edd7c]
[str957-bl0-03:05271] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f38246e9fe4]
[str957-bl0-03:05271] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f3826fb9656]
[str957-bl0-03:05271] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f3825b8011a]
[str957-bl0-03:05271] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f3825e50e62]
[str957-bl0-03:05271] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f3825e7f17e]
[str957-bl0-03:05271] [12] ./C-GenIC(+0x23b9)[0x55bf9fa8e3b9]
[str957-bl0-03:05271] [13]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f3825c4709b]
[str957-bl0-03:05271] [14] ./C-GenIC(+0x251a)[0x55bf9fa8e51a]
[str957-bl0-03:05271] *** End of error message ***


In the past, just installing gdb to try to debug it made the problem
disappear: obviously it was not a solution...

Any hint?

TIA

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




More information about the slurm-users mailing list