[slurm-users] mpich and sbatch

Mon Dec 5 16:33:27 UTC 2022

HI,

Maybe this has been asked several times, and a solution might be readily 
available.

Facility topology: 7 identical machines, one host 6 clients, 16 Gram, 8 
core each. Which are virtual machines. the bare metal machine 
configuration is unfortunately not known to me.

the slurm.conf listed below

use case: i have a 30 different scripts, each script activates an 
application in a separate partition of a shared disk

All input files are copied from a repository in the specific partition

and the application is activated by mpirun -n 8 executable

if i submit a single instance of the ksh script to the slurm batch 
(sbatch ksh-script) the system is happy to use all 8 cores at 100 % for 
the user

however, if i submit more than say 4  instances, these jobs will be 
submitted to the various nodes, but nmon or htop indicates that each of 
the 8 core is 100% used, however the partition is roughly 25 % and 75 % 
steal

Question is if this the result of a slurm setting (if yes which setting 
should i add to the conf file)

or if this is an issue with the setting - configuration of the 
bare-metal machine

many thanks

stephen

========================================================================

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=***
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm-wlm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm-wlm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-wlm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-wlm/slurmd.log
#
# COMPUTE NODES
NodeName=*** NodeAddr=** CPUs=1 RealMemory=16000 State=UNKNOWN
PartitionName=w4repp Nodes=ALL Default=YES MaxTime=INFINITE State=UP

#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP