[slurm-users] [Support] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention

SJTU weijianwen at sjtu.edu.cn
Thu Sep 17 03:16:42 UTC 2020


Thank you, Paul. I'll try this workaround.


Best,

Jianwen

> On Sep 16, 2020, at 9:31 PM, Paul Edmon <pedmon at cfa.harvard.edu> wrote:
> 
> This is a feature of suspend.  When Slurm suspends a job it actually does not leave the cpus used by that job reserved but instead pauses the job and keeps memory reserved but not the cpus.
> 
> If you want to pause jobs and not have contention you need to use scancel and use the:
> 
> -s, --signal=signal_name
> The name or number of the signal to send. If this option is not used the specified job or step will be terminated. Note. If this option is used the signal is sent directly to the slurmd where the job is running bypassing the slurmctld thus the job state will not change even if the signal is delivered to it. Use the scontrol command if you want the job state change be known to slurmctld.
> And issue the SIGSTOP or SIGCONT.
> 
> Frankly I wish suspend didn't work like this.  It should work where it suspends the job and does not release the cpus but keeps them reserved.  That's the natural understanding of suspend, but that's not the way suspend actually work in Slurm.
> 
> -Paul Edmon-
> 
> On 9/16/2020 6:08 AM, SJTU wrote:
>> Hi,
>> 
>> I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with suspended jobs, which leads to resource contention after the suspended jobs' restoration. Steps to reproduce this issue are:
>> 
>> 1. Launch 40 one-core jobs on a 40-core compute node. 
>> 2. Suspend all 40 jobs on that compute node with `scontrol suspend JOBID`.
>> 
>> Expected results: No more jobs should be launched on to the compute node since there are 40 suspended jobs on it already.
>> 
>> Actual results: SLURM launches new jobs on that compute node, which may lead to resource contention if the previously suspended jobs are restored via `scontrol resume` at the moment.
>>  
>> Any suggestion is appreciated. Part of slurm.conf is attached.
>> 
>> Thank you!
>> 
>> 
>> Jianwen
>> 
>> 
>> 
>> 
>> AccountingStorageEnforce = associations,limits,qos,safe
>> AccountingStorageType = accounting_storage/slurmdbd
>> AuthType = auth/munge
>> BackupController = slurm2
>> CacheGroups = 0
>> ClusterName = mycluster
>> ControlMachine = slurm1
>> EnforcePartLimits = true
>> Epilog = /etc/slurm/slurm.epilog
>> FastSchedule = 1
>> GresTypes = gpu
>> HealthCheckInterval = 300
>> HealthCheckProgram = /usr/sbin/nhc
>> InactiveLimit = 0
>> JobAcctGatherFrequency = 30
>> JobAcctGatherType = jobacct_gather/cgroup
>> JobCompType = jobcomp/none
>> JobRequeue = 0
>> JobSubmitPlugins = lua
>> KillOnBadExit = 1
>> KillWait = 30
>> MailProg = /opt/slurm-mail/bin/slurm-spool-mail.py
>> MaxArraySize = 8196
>> MaxJobCount = 100000
>> MessageTimeout = 30
>> MinJobAge = 300
>> MpiDefault = none
>> PriorityDecayHalfLife = 31-0
>> PriorityFavorSmall = false
>> PriorityFlags = ACCRUE_ALWAYS,FAIR_TREE
>> PriorityMaxAge = 7-0
>> PriorityType = priority/multifactor
>> PriorityWeightAge = 10000
>> PriorityWeightFairshare = 10000
>> PriorityWeightJobSize = 40000
>> PriorityWeightPartition = 10000
>> PriorityWeightQOS = 0
>> PrivateData = accounts,jobs,usage,users,reservations
>> ProctrackType = proctrack/cgroup
>> Prolog = /etc/slurm/slurm.prolog
>> PrologFlags = contain
>> PropagateResourceLimitsExcept = MEMLOCK
>> RebootProgram = /usr/sbin/reboot
>> ResumeTimeout = 600
>> ResvOverRun = UNLIMITED
>> ReturnToService = 1
>> SchedulerType = sched/backfill
>> SelectType = select/cons_res
>> SelectTypeParameters = CR_CPU
>> SlurmUser = root
>> SlurmctldDebug = info
>> SlurmctldLogFile = /var/log/slurmctld.log
>> SlurmctldPidFile = /var/run/slurmctld.pid
>> SlurmctldPort = 6817
>> SlurmctldTimeout = 120
>> SlurmdDebug = info
>> SlurmdLogFile = /var/log/slurmd.log
>> SlurmdPidFile = /var/run/slurmd.pid
>> SlurmdPort = 6818
>> SlurmdSpoolDir = /tmp/slurmd
>> SlurmdTimeout = 300
>> SrunPortRange = 60001-63000
>> StateSaveLocation = /etc/slurm/state
>> SwitchType = switch/none
>> TaskPlugin = task/cgroup
>> Waittime = 0
>> 
>> 
>> # Nodes
>> NodeName=cas[001-100] CPUs=40 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=190000 Weight=60
>> 
>> 
>> # Partitions
>> PartitionName=small Nodes=cas[001-100] MaxCPUsPerNode=39 MaxNodes=1 MaxTime=7-00:00:00 DefMemPerCPU=4700 MaxMemPerCPU=4700 State=UP AllowQos=ALL
>> 
>> 
>> 
>> 
> _______________________________________________
> Support mailing list
> Support at lists.hpc.sjtu.edu.cn
> http://lists.hpc.sjtu.edu.cn/mailman/listinfo/support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200917/3a3ae731/attachment.htm>


More information about the slurm-users mailing list