[slurm-users] Intermittent problem at 32 CPUs

Sun Jun 7 07:44:50 UTC 2020

On 05/06/20 15:29, Riebs, Andy wrote:

Tks for the answer.

> I'm *guessing* that you are tripping over the use of "--tasks 32" on a heterogeneous cluster,
If you mean that using "--tasks 32" trips the use of a second node, then
no. The node does have two AMD Opteron 6274 .

> though your comment about the node without InfiniBand troubles me. If you drain that node, or exclude it in your command line, that might correct the problem. I wonder if OMPI and PMIx have decided that IB is the way to go, and are failing when they try to set up on the node without IB.
The job uses a single node. On another node (identical HW: they're two
servers in 1U) the same job works with 32 tasks. Nodes are configured
via a script, so the config should be exactly the same, but maybe
something fell out of sync (continuous updates w/o reinstall since
Debian 8!). But I could't find anything obviously different.

> If that's not it, I'd try
> 0. Check sacct for the node lists for the successful and unsuccessful runs -- a problem node might jump out.
> 1. Running your job with explicit node lists. Again, you may find a problem node this way.
I already run it with explicit nodelist to address the problematic node
to try to identify and resolve the problem, not to avoid it leaving a
node unused...

> p.s. If this doesn't fix it, please include the Slurm and OMPI versions, and a copy of your slurm.conf file (with identifying information like node names removed) in your next note to this list.
I'm using Debian-packaged versions:
slurm-client/stable,stable,now 18.08.5.2-1+deb10u1 amd64
openmpi-bin/stable,now 3.1.3-11 amd64

slurm.conf (nodes and partitions omitted):
-8<--
SlurmCtldHost=str957-cluster(#.#.#.#)
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
EnforcePartLimits=YES
MpiDefault=none
MpiParams=ports=12000-12999
ProctrackType=proctrack/cgroup
PrologSlurmctld=/etc/slurm-llnl/SlurmCtldProlog.sh
ReturnToService=2
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
TmpFS=/mnt/local_data/
UsePAM=1
GetEnvTimeout=20
InactiveLimit=0
KillWait=120
MinJobAge=300
SlurmctldTimeout=20
SlurmdTimeout=30
Waittime=10
FastSchedule=0
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PreemptMode=CANCEL
PreemptType=preempt/partition_prio
AccountingStorageEnforce=safe
AccountingStorageHost=str957-cluster
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AcctGatherNodeFreq=300
ClusterName=oph
JobCompLoc=/var/spool/slurm/jobscompleted.txt
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
-8<--

I've had a similar problem while adding new nodes in a new partition. I
"solved" (probably) by adding a line
mtl = psm2
to /etc/openmpi/openmpi-mca-params.conf .
But those were nodes with IB.

Since I'm quite ignorant about the whole MPI and IB ecosystem, it's
mostly guesswork...

-- 
Diego Zuccato
Servizi Informatici
Dip. di Fisica e Astronomia (DIFA) - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
mail: diego.zuccato at unibo.it