[slurm-users] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention

Wed Sep 16 10:08:34 UTC 2020

Hi,

I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with suspended jobs, which leads to resource contention after the suspended jobs' restoration. Steps to reproduce this issue are:

1. Launch 40 one-core jobs on a 40-core compute node. 
2. Suspend all 40 jobs on that compute node with `scontrol suspend JOBID`.

Expected results: No more jobs should be launched on to the compute node since there are 40 suspended jobs on it already.

Actual results: SLURM launches new jobs on that compute node, which may lead to resource contention if the previously suspended jobs are restored via `scontrol resume` at the moment.

Any suggestion is appreciated. Part of slurm.conf is attached.

Thank you!

Jianwen

AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageType = accounting_storage/slurmdbd
AuthType = auth/munge
BackupController = slurm2
CacheGroups = 0
ClusterName = mycluster
ControlMachine = slurm1
EnforcePartLimits = true
Epilog = /etc/slurm/slurm.epilog
FastSchedule = 1
GresTypes = gpu
HealthCheckInterval = 300
HealthCheckProgram = /usr/sbin/nhc
InactiveLimit = 0
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/cgroup
JobCompType = jobcomp/none
JobRequeue = 0
JobSubmitPlugins = lua
KillOnBadExit = 1
KillWait = 30
MailProg = /opt/slurm-mail/bin/slurm-spool-mail.py
MaxArraySize = 8196
MaxJobCount = 100000
MessageTimeout = 30
MinJobAge = 300
MpiDefault = none
PriorityDecayHalfLife = 31-0
PriorityFavorSmall = false
PriorityFlags = ACCRUE_ALWAYS,FAIR_TREE
PriorityMaxAge = 7-0
PriorityType = priority/multifactor
PriorityWeightAge = 10000
PriorityWeightFairshare = 10000
PriorityWeightJobSize = 40000
PriorityWeightPartition = 10000
PriorityWeightQOS = 0
PrivateData = accounts,jobs,usage,users,reservations
ProctrackType = proctrack/cgroup
Prolog = /etc/slurm/slurm.prolog
PrologFlags = contain
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram = /usr/sbin/reboot
ResumeTimeout = 600
ResvOverRun = UNLIMITED
ReturnToService = 1
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CPU
SlurmUser = root
SlurmctldDebug = info
SlurmctldLogFile = /var/log/slurmctld.log
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPort = 6817
SlurmctldTimeout = 120
SlurmdDebug = info
SlurmdLogFile = /var/log/slurmd.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300
SrunPortRange = 60001-63000
StateSaveLocation = /etc/slurm/state
SwitchType = switch/none
TaskPlugin = task/cgroup
Waittime = 0

# Nodes
NodeName=cas[001-100] CPUs=40 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=190000 Weight=60

# Partitions
PartitionName=small Nodes=cas[001-100] MaxCPUsPerNode=39 MaxNodes=1 MaxTime=7-00:00:00 DefMemPerCPU=4700 MaxMemPerCPU=4700 State=UP AllowQos=ALL