[slurm-users] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention

Paul Edmon pedmon at cfa.harvard.edu
Wed Sep 16 13:31:37 UTC 2020


This is a feature of suspend.  When Slurm suspends a job it actually 
does not leave the cpus used by that job reserved but instead pauses the 
job and keeps memory reserved but not the cpus.

If you want to pause jobs and not have contention you need to use 
scancel and use the:

*-s*, *--signal*=/signal_name/
    The name or number of the signal to send. If this option is not used
    the specified job or step will be terminated. *Note*. If this option
    is used the signal is sent directly to the slurmd where the job is
    running bypassing the slurmctld thus the job state will not change
    even if the signal is delivered to it. Use the /scontrol/ command if
    you want the job state change be known to slurmctld. 

And issue the SIGSTOP or SIGCONT.

Frankly I wish suspend didn't work like this.  It should work where it 
suspends the job and does not release the cpus but keeps them reserved.  
That's the natural understanding of suspend, but that's not the way 
suspend actually work in Slurm.

-Paul Edmon-

On 9/16/2020 6:08 AM, SJTU wrote:
> Hi,
>
> I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with suspended jobs, which leads to resource contention after the suspended jobs' restoration. Steps to reproduce this issue are:
>
> 1. Launch 40 one-core jobs on a 40-core compute node.
> 2. Suspend all 40 jobs on that compute node with `scontrol suspend JOBID`.
>
> Expected results: No more jobs should be launched on to the compute node since there are 40 suspended jobs on it already.
>
> Actual results: SLURM launches new jobs on that compute node, which may lead to resource contention if the previously suspended jobs are restored via `scontrol resume` at the moment.
>   
> Any suggestion is appreciated. Part of slurm.conf is attached.
>
> Thank you!
>
>
> Jianwen
>
>
>
>
> AccountingStorageEnforce = associations,limits,qos,safe
> AccountingStorageType = accounting_storage/slurmdbd
> AuthType = auth/munge
> BackupController = slurm2
> CacheGroups = 0
> ClusterName = mycluster
> ControlMachine = slurm1
> EnforcePartLimits = true
> Epilog = /etc/slurm/slurm.epilog
> FastSchedule = 1
> GresTypes = gpu
> HealthCheckInterval = 300
> HealthCheckProgram = /usr/sbin/nhc
> InactiveLimit = 0
> JobAcctGatherFrequency = 30
> JobAcctGatherType = jobacct_gather/cgroup
> JobCompType = jobcomp/none
> JobRequeue = 0
> JobSubmitPlugins = lua
> KillOnBadExit = 1
> KillWait = 30
> MailProg = /opt/slurm-mail/bin/slurm-spool-mail.py
> MaxArraySize = 8196
> MaxJobCount = 100000
> MessageTimeout = 30
> MinJobAge = 300
> MpiDefault = none
> PriorityDecayHalfLife = 31-0
> PriorityFavorSmall = false
> PriorityFlags = ACCRUE_ALWAYS,FAIR_TREE
> PriorityMaxAge = 7-0
> PriorityType = priority/multifactor
> PriorityWeightAge = 10000
> PriorityWeightFairshare = 10000
> PriorityWeightJobSize = 40000
> PriorityWeightPartition = 10000
> PriorityWeightQOS = 0
> PrivateData = accounts,jobs,usage,users,reservations
> ProctrackType = proctrack/cgroup
> Prolog = /etc/slurm/slurm.prolog
> PrologFlags = contain
> PropagateResourceLimitsExcept = MEMLOCK
> RebootProgram = /usr/sbin/reboot
> ResumeTimeout = 600
> ResvOverRun = UNLIMITED
> ReturnToService = 1
> SchedulerType = sched/backfill
> SelectType = select/cons_res
> SelectTypeParameters = CR_CPU
> SlurmUser = root
> SlurmctldDebug = info
> SlurmctldLogFile = /var/log/slurmctld.log
> SlurmctldPidFile = /var/run/slurmctld.pid
> SlurmctldPort = 6817
> SlurmctldTimeout = 120
> SlurmdDebug = info
> SlurmdLogFile = /var/log/slurmd.log
> SlurmdPidFile = /var/run/slurmd.pid
> SlurmdPort = 6818
> SlurmdSpoolDir = /tmp/slurmd
> SlurmdTimeout = 300
> SrunPortRange = 60001-63000
> StateSaveLocation = /etc/slurm/state
> SwitchType = switch/none
> TaskPlugin = task/cgroup
> Waittime = 0
>
>
> # Nodes
> NodeName=cas[001-100] CPUs=40 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=190000 Weight=60
>
>
> # Partitions
> PartitionName=small Nodes=cas[001-100] MaxCPUsPerNode=39 MaxNodes=1 MaxTime=7-00:00:00 DefMemPerCPU=4700 MaxMemPerCPU=4700 State=UP AllowQos=ALL
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200916/6a5f2ba9/attachment.htm>


More information about the slurm-users mailing list