[slurm-users] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention
Paul Edmon
pedmon at cfa.harvard.edu
Wed Sep 16 13:31:37 UTC 2020
This is a feature of suspend. When Slurm suspends a job it actually
does not leave the cpus used by that job reserved but instead pauses the
job and keeps memory reserved but not the cpus.
If you want to pause jobs and not have contention you need to use
scancel and use the:
*-s*, *--signal*=/signal_name/
The name or number of the signal to send. If this option is not used
the specified job or step will be terminated. *Note*. If this option
is used the signal is sent directly to the slurmd where the job is
running bypassing the slurmctld thus the job state will not change
even if the signal is delivered to it. Use the /scontrol/ command if
you want the job state change be known to slurmctld.
And issue the SIGSTOP or SIGCONT.
Frankly I wish suspend didn't work like this. It should work where it
suspends the job and does not release the cpus but keeps them reserved.
That's the natural understanding of suspend, but that's not the way
suspend actually work in Slurm.
-Paul Edmon-
On 9/16/2020 6:08 AM, SJTU wrote:
> Hi,
>
> I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with suspended jobs, which leads to resource contention after the suspended jobs' restoration. Steps to reproduce this issue are:
>
> 1. Launch 40 one-core jobs on a 40-core compute node.
> 2. Suspend all 40 jobs on that compute node with `scontrol suspend JOBID`.
>
> Expected results: No more jobs should be launched on to the compute node since there are 40 suspended jobs on it already.
>
> Actual results: SLURM launches new jobs on that compute node, which may lead to resource contention if the previously suspended jobs are restored via `scontrol resume` at the moment.
>
> Any suggestion is appreciated. Part of slurm.conf is attached.
>
> Thank you!
>
>
> Jianwen
>
>
>
>
> AccountingStorageEnforce = associations,limits,qos,safe
> AccountingStorageType = accounting_storage/slurmdbd
> AuthType = auth/munge
> BackupController = slurm2
> CacheGroups = 0
> ClusterName = mycluster
> ControlMachine = slurm1
> EnforcePartLimits = true
> Epilog = /etc/slurm/slurm.epilog
> FastSchedule = 1
> GresTypes = gpu
> HealthCheckInterval = 300
> HealthCheckProgram = /usr/sbin/nhc
> InactiveLimit = 0
> JobAcctGatherFrequency = 30
> JobAcctGatherType = jobacct_gather/cgroup
> JobCompType = jobcomp/none
> JobRequeue = 0
> JobSubmitPlugins = lua
> KillOnBadExit = 1
> KillWait = 30
> MailProg = /opt/slurm-mail/bin/slurm-spool-mail.py
> MaxArraySize = 8196
> MaxJobCount = 100000
> MessageTimeout = 30
> MinJobAge = 300
> MpiDefault = none
> PriorityDecayHalfLife = 31-0
> PriorityFavorSmall = false
> PriorityFlags = ACCRUE_ALWAYS,FAIR_TREE
> PriorityMaxAge = 7-0
> PriorityType = priority/multifactor
> PriorityWeightAge = 10000
> PriorityWeightFairshare = 10000
> PriorityWeightJobSize = 40000
> PriorityWeightPartition = 10000
> PriorityWeightQOS = 0
> PrivateData = accounts,jobs,usage,users,reservations
> ProctrackType = proctrack/cgroup
> Prolog = /etc/slurm/slurm.prolog
> PrologFlags = contain
> PropagateResourceLimitsExcept = MEMLOCK
> RebootProgram = /usr/sbin/reboot
> ResumeTimeout = 600
> ResvOverRun = UNLIMITED
> ReturnToService = 1
> SchedulerType = sched/backfill
> SelectType = select/cons_res
> SelectTypeParameters = CR_CPU
> SlurmUser = root
> SlurmctldDebug = info
> SlurmctldLogFile = /var/log/slurmctld.log
> SlurmctldPidFile = /var/run/slurmctld.pid
> SlurmctldPort = 6817
> SlurmctldTimeout = 120
> SlurmdDebug = info
> SlurmdLogFile = /var/log/slurmd.log
> SlurmdPidFile = /var/run/slurmd.pid
> SlurmdPort = 6818
> SlurmdSpoolDir = /tmp/slurmd
> SlurmdTimeout = 300
> SrunPortRange = 60001-63000
> StateSaveLocation = /etc/slurm/state
> SwitchType = switch/none
> TaskPlugin = task/cgroup
> Waittime = 0
>
>
> # Nodes
> NodeName=cas[001-100] CPUs=40 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=190000 Weight=60
>
>
> # Partitions
> PartitionName=small Nodes=cas[001-100] MaxCPUsPerNode=39 MaxNodes=1 MaxTime=7-00:00:00 DefMemPerCPU=4700 MaxMemPerCPU=4700 State=UP AllowQos=ALL
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200916/6a5f2ba9/attachment.htm>
More information about the slurm-users
mailing list