[slurm-users] Segfault with 32 processes, OK with 30 ???

Diego Zuccato diego.zuccato at unibo.it
Thu Oct 8 06:37:51 UTC 2020


Il 06/10/20 13:45, Riebs, Andy ha scritto:

Well, the cluster is quite heterogeneus, and node bl0-02 only have 24
threads available:
str957-bl0-02:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               45
Model name:          Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Stepping:            7
CPU MHz:             1943.442
CPU max MHz:         2500,0000
CPU min MHz:         1200,0000
BogoMIPS:            4000.26
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            15360K
NUMA node0 CPU(s):   0-5,12-17
NUMA node1 CPU(s):   6-11,18-23
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti
tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts


str957-bl0-03:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2400.142
CPU max MHz:         2300,0000
CPU min MHz:         1200,0000
BogoMIPS:            4800.28
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1
sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm
cpuid_fault epb invpcid_single pti intel_ppin tpr_shadow vnmi
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2
erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts

Another couple of nodes do have 32 threads, but with AMD CPU...

The same problem happened in the past, and seemed to "move" between
nodes even with no changes in the config. While trying to fix it I added
mtl = psm2
to /etc/openmpi/openmpi-mca-params.conf but only installing gdb and its
dependencies apparently "worked". But, as I feared, it wos just a mask,
not a solution.

>> The problem is with a single, specific, node: str957-bl0-03 . The same
>> job script works if being allocated to another node, even with more
>> ranks (tested up to 224/4 on mtx-* nodes).
> 
> Ahhh... here's where the details help. So it appears that the problem is on a single node, and probably not a general configuration or system problem. I suggest starting with  something like this to help figure out why node bl0-03 is different
> 
> $ sudo ssh str957- bl0-02 lscpu
> $ sudo ssh str957- bl0-03 lscpu
> 
> Andy
> 
> -----Original Message-----
> From: Diego Zuccato [mailto:diego.zuccato at unibo.it] 
> Sent: Tuesday, October 6, 2020 3:13 AM
> To: Riebs, Andy <andy.riebs at hpe.com>; Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] Segfault with 32 processes, OK with 30 ???
> 
> Il 05/10/20 14:18, Riebs, Andy ha scritto:
> 
> Tks for considering my query.
> 
>> You need to provide some hints! What we know so far:
>> 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x backtrace.
> Correct.
> 
>> 2. Your decision to address this to the Slurm mailing list suggests that you think that Slurm might be involved.
> At least I couldn't replicate launching manually (it always says "no
> slots available" unless I use mpirun -np 16 ...). I'm no MPI expert
> (actually less than a noob!) so I can't rule out it's unrelated to
> Slurm. I mostly hope that on this list I can find someone with enough
> experience with both Slurm and MPI.
> 
>> 3. You have something (a job? a program?) that segfaults when you go from 30 to 32 processes.
> Multiple programs, actually.
> 
>> a. What operating system?
> Debian 10.5 . Only extension is PBIS-open to authenticate users from AD.
> 
>> b. Are you seeing this while running Slurm? What version?
> 18.04, Debian packages
> 
>> c. What version of Open MPI?
> openmpi-bin/stable,now 3.1.3-11 amd64
> 
>> d. Are you building your own PMI-x, or are you using what's provided by Open MPI and Slurm?
> Using Debian packages
> 
>> e. What does your hardware configuration look like -- particularly, what cpu type(s), and how many cores/node?
> The node uses dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz for a total
> of 32 threads (hyperthreading is enabled: 2 sockets, 8 cores per socket,
> 2 threads per core).
> 
>> f. What does you Slurm configuration look like (assuming you're seeing this with Slurm)? I suggest purging your configuration files of node names and IP addresses, and including them with your query.
> Here it is:
> -8<--
> SlurmCtldHost=str957-cluster(*.*.*.*)
> AuthType=auth/munge
> CacheGroups=0
> CryptoType=crypto/munge
> #DisableRootJobs=NO
> EnforcePartLimits=YES
> JobSubmitPlugins=lua
> MpiDefault=none
> MpiParams=ports=12000-12999
> ReturnToService=2
> SlurmctldPidFile=/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/cgroup
> TmpFS=/mnt/local_data/
> UsePAM=1
> GetEnvTimeout=20
> InactiveLimit=0
> KillWait=120
> MinJobAge=300
> SlurmctldTimeout=20
> SlurmdTimeout=30
> FastSchedule=0
> SchedulerType=sched/backfill
> SchedulerPort=7321
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> PriorityFlags=MAX_TRES
> PriorityType=priority/multifactor
> PreemptMode=CANCEL
> PreemptType=preempt/partition_prio
> AccountingStorageEnforce=safe,qos
> AccountingStorageHost=str957-cluster
> #AccountingStorageLoc=
> #AccountingStoragePass=
> #AccountingStoragePort=6819
> #AccountingStorageTRES=
> AccountingStorageType=accounting_storage/slurmdbd
> #AccountingStorageUser=
> AccountingStoreJobComment=YES
> AcctGatherNodeFreq=300
> ClusterName=oph
> JobCompLoc=/var/spool/slurm/jobscompleted.txt
> JobCompType=jobcomp/filetxt
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm/slurmd.log
> NodeName=DEFAULT Sockets=2 ThreadsPerCore=2 State=UNKNOWN
> NodeName=str957-bl0-0[1-2] CoresPerSocket=6 Feature=ib,blade,intel
> NodeName=str957-bl0-0[3-5] CoresPerSocket=8 Feature=ib,blade,intel
> NodeName=str957-bl0-[15-16] CoresPerSocket=4 Feature=ib,nonblade,intel
> NodeName=str957-bl0-[17-18] CoresPerSocket=6 ThreadsPerCore=1
> Feature=nonblade,amd
> NodeName=str957-bl0-[19-20] Sockets=4 CoresPerSocket=8 ThreadsPerCore=1
> Feature=nonblade,amd
> NodeName=str957-mtx-[00-15] CoresPerSocket=14 Feature=ib,nonblade,intel
> -8<--
> 
>> g. What does your command line look like? Especially, are you trying to run 32 processes on a single node? Spreading them out across 2 or more nodes?
> The problem is with a single, specific, node: str957-bl0-03 . The same
> job script works if being allocated to another node, even with more
> ranks (tested up to 224/4 on mtx-* nodes).
> 
>> h. Can you reproduce the problem if you substitute `hostname` or `true` for the program in the command line? What about a simple MPI-enabled "hello world?"I'll try ASAP w/ a simple 'hostname'. But I expect it to work.
> The original problem is with a complex program run by an user. To try to
> debug the issue I'm using what I think is the simplest mpi program possible:
> -8<--
> #include "mpi.h"
> #include <stdio.h>
> #include <stdlib.h>
> #define  MASTER         0
> 
> int main (int argc, char *argv[])
> {
>   int   numtasks, taskid, len;
>   char hostname[MPI_MAX_PROCESSOR_NAME];
>   MPI_Init(&argc, &argv);
> //  int provided=0;
> //  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
> //printf("MPI provided threads: %d\n", provided);
>   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
>   MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
> 
>   if (taskid == MASTER)
>     printf("This is an MPI parallel code for Hello World with no
> communication\n");
>   //MPI_Barrier(MPI_COMM_WORLD);
> 
> 
>   MPI_Get_processor_name(hostname, &len);
> 
>   printf ("Hello from task %d on %s!\n", taskid, hostname);
> 
>   if (taskid == MASTER)
>     printf("MASTER: Number of MPI tasks is: %d\n",numtasks);
> 
>   MPI_Finalize();
> 
>   printf("END OF CODE from task %d\n", taskid);
> }
> -8<--
> And I got failures with it, too.
> 


-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list