[slurm-users] mpich and sbatch
stephen tjemkes
stephen.tjemkes at hotmail.com
Mon Dec 5 16:33:27 UTC 2022
HI,
Maybe this has been asked several times, and a solution might be readily
available.
Facility topology: 7 identical machines, one host 6 clients, 16 Gram, 8
core each. Which are virtual machines. the bare metal machine
configuration is unfortunately not known to me.
the slurm.conf listed below
use case: i have a 30 different scripts, each script activates an
application in a separate partition of a shared disk
All input files are copied from a repository in the specific partition
and the application is activated by mpirun -n 8 executable
if i submit a single instance of the ksh script to the slurm batch
(sbatch ksh-script) the system is happy to use all 8 cores at 100 % for
the user
however, if i submit more than say 4 instances, these jobs will be
submitted to the various nodes, but nmon or htop indicates that each of
the 8 core is 100% used, however the partition is roughly 25 % and 75 %
steal
Question is if this the result of a slurm setting (if yes which setting
should i add to the conf file)
or if this is an issue with the setting - configuration of the
bare-metal machine
many thanks
stephen
========================================================================
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=***
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm-wlm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm-wlm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-wlm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-wlm/slurmd.log
#
# COMPUTE NODES
NodeName=*** NodeAddr=** CPUs=1 RealMemory=16000 State=UNKNOWN
PartitionName=w4repp Nodes=ALL Default=YES MaxTime=INFINITE State=UP
#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
More information about the slurm-users
mailing list