Hi All, we are trying to implement preemption in one of our partitions so we can run priority jobs on it and suspend the ones running on the partition and resume once the priority job is done. We have read through the Slurm documentation and did the configuration, but somehow we can not make it work. We tried other preemption like cancel which works fine, but Suspend isn't working. The jobs with higher priority which should suspend other jobs stays as pending and waiting for resources. We are using a QOS to assign priority to the job and also configuring the QOS so it can preempt certain other QOSs. this is the output of our scontrol show config :
AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = -Some-server- AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = job_comment,job_env,job_extra,job_script AcctGatherEnergyType = (null) AcctGatherFilesystemType = (null) AcctGatherInterconnectType = (null) AcctGatherNodeFreq = 0 sec AcctGatherProfileType = (null) AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64 BcastParameters = (null) BOOT_TIME = 2024-03-15T13:32:05 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = cluster CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = (null) CpuFreqDef = Unknown CpuFreqGovernors = OnDemand,Performance,UserSpace CredType = cred/munge DebugFlags = (null) DefMemPerCPU = 3500 DependencyParameters = (null) DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = ALL Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = (null) ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = (null) GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 300 sec HealthCheckNodeState = ANY HealthCheckProgram = /usr/sbin/nhc InactiveLimit = 0 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = (null) JobCompParams = (null) JobCompPort = 0 JobCompType = (null) JobCompUser = root JobContainerType = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = (null) Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxBatchRequeue = 5 MaxDBDMsgs = 20024 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxNodeCount = 6 MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = (null) MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = pmix_v2 MpiParams = (null) NEXT_JOB_ID = 3626 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = (null) PreemptMode = GANG,SUSPEND PreemptParameters = (null) PreemptType = preempt/qos PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 0 PriorityWeightAssoc = 0 PriorityWeightFairShare = 0 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 5000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /etc/slurm/slurmupdate.sh ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 SchedulerParameters = bf_max_job_user=2 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill ScronParameters = (null) SelectType = select/cons_tres SelectTypeParameters = CR_CORE_MEMORY SlurmUser = slurm(888) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = -some-server- SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = (null) SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 120 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurm SlurmdSyslogDebug = (null) SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 23.11.1 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendExcStates = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = INFINITE SuspendTimeout = 30 sec SwitchParameters = (null) SwitchType = (null) TaskEpilog = (null) TaskPlugin = task/cgroup,task/affinity TaskPluginParam = none TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/default TrackWCKey = No TreeWidth = 16 UsePam = No UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null)
Cgroup Support Configuration: AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupMountpoint = /sys/fs/cgroup CgroupPlugin = autodetect ConstrainCores = yes ConstrainDevices = yes ConstrainRAMSpace = yes ConstrainSwapSpace = no EnableControllers = no IgnoreSystemd = no IgnoreSystemdOnFailure = no MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinRAMSpace = 30 MB
MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null)
the partition is configured like:
PartitionName=test-partition AllowGroups=sysadmin,users AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=vserv-[275-277] PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=32,mem=304933M,node=4,billing=32
Our QOS looks like this:
| Name | Priority | GraceTime | Preempt | PreemptMode | Flags | UsageFactor | MaxTRESPU | MaxJobsPU | MaxSubmitPU |-----------|----------|-----------|-----------|-------------|-----------------|-------------|-----------------|-----------|----------- | normal | 50 | 00:00:00 | | cluster | | 1.000000 | | 20 | 50 | preempter | 1000 | 00:00:00 | preempted | gang,suspe+| | 1.000000 | | | | preempted | 0 | 00:00:00 | | gang,suspe+ | OverPartQOS | 1.000000 | cpu=100,node=10 | |
I can provide more configs if needed. Do you guys see anything strange? Or any property to be set? This is the state of the queue:
|JOBID | QOS | ST | TIME | NODELIST(REASON) | PARTITION | PRIORITY | | ---------------- ----------- --------- ----------- ---------------------- ----------------- ----------| | 3629 | preempted | PD | 0:00 | (Resources) | test-partition | 1 | | 3627 | preempted | R | 0:20 | vserv-276 | test-partition | 1 | | 3628 | preempted | R | 0:20 | vserv-277 | test-partition | 1 | | 3626 | preempted | R | 0:27 | vserv-275 | test-partition | 1 | | 3630 | preempter | PD | 0:00 | (Resources) | test-partition | 5000 |
any advice is welcomed.
Regards,
Nischey Verma
Nischey Verma
CPUs are released, but memory is not released on suspend. Try looking at this output and compare allocated Memory before and after suspending a job on a node:
sinfo -N -n yourNode --Format=weight:8,nodelist:15,cpusstate:12,memory:8,allocmem:8
From: Verma, Nischey (HPC ENG,RAL,LSCI) via slurm-users slurm-users@lists.schedmd.com Sent: Friday, March 15, 2024 11:06 AM To: slurm-users@lists.schedmd.com Cc: Taneja, Sonia (DLSLtd,RAL,LSCI) sonia.taneja@diamond.ac.uk Subject: [slurm-users] Slurm suspend preemption not working
Hi All, we are trying to implement preemption in one of our partitions so we can run priority jobs on it and suspend the ones running on the partition and resume once the priority job is done. We have read through the Slurm documentation and did the configuration, but somehow we can not make it work. We tried other preemption like cancel which works fine, but Suspend isn't working. The jobs with higher priority which should suspend other jobs stays as pending and waiting for resources. We are using a QOS to assign priority to the job and also configuring the QOS so it can preempt certain other QOSs. this is the output of our scontrol show config :
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost = -Some-server-
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort = 6819
AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreFlags = job_comment,job_env,job_extra,job_script
AcctGatherEnergyType = (null)
AcctGatherFilesystemType = (null)
AcctGatherInterconnectType = (null)
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = (null)
AllowSpecResourcesUsage = No
AuthAltTypes = (null)
AuthAltParameters = (null)
AuthInfo = (null)
AuthType = auth/munge
BatchStartTimeout = 10 sec
BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters = (null)
BOOT_TIME = 2024-03-15T13:32:05
BurstBufferType = (null)
CliFilterPlugins = (null)
ClusterName = cluster
CommunicationParameters = (null)
CompleteWait = 0 sec
CoreSpecPlugin = (null)
CpuFreqDef = Unknown
CpuFreqGovernors = OnDemand,Performance,UserSpace
CredType = cred/munge
DebugFlags = (null)
DefMemPerCPU = 3500
DependencyParameters = (null)
DisableRootJobs = Yes
EioTimeout = 60
EnforcePartLimits = ALL
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = (null)
ExtSensorsFreq = 0 sec
FairShareDampeningFactor = 1
FederationParameters = (null)
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = gpu
GpuFreqDef = (null)
GroupUpdateForce = 1
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 300 sec
HealthCheckNodeState = ANY
HealthCheckProgram = /usr/sbin/nhc
InactiveLimit = 0 sec
InteractiveStepOptions = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)
JobCompHost = localhost
JobCompLoc = (null)
JobCompParams = (null)
JobCompPort = 0
JobCompType = (null)
JobCompUser = root
JobContainerType = (null)
JobDefaults = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = lua
KillOnBadExit = 0
KillWait = 30 sec
LaunchParameters = (null)
Licenses = (null)
LogTimeFormat = iso8601_ms
MailDomain = (null)
MailProg = /bin/mail
MaxArraySize = 1001
MaxBatchRequeue = 5
MaxDBDMsgs = 20024
MaxJobCount = 10000
MaxJobId = 67043328
MaxMemPerNode = UNLIMITED
MaxNodeCount = 6
MaxStepCount = 40000
MaxTasksPerNode = 512
MCSPlugin = (null)
MCSParameters = (null)
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = pmix_v2
MpiParams = (null)
NEXT_JOB_ID = 3626
NodeFeaturesPlugins = (null)
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = (null)
PowerParameters = (null)
PowerPlugin = (null)
PreemptMode = GANG,SUSPEND
PreemptParameters = (null)
PreemptType = preempt/qos
PreemptExemptTime = 00:00:00
PrEpParameters = (null)
PrEpPlugins = prep/script
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags =
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 0
PriorityWeightAssoc = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize = 0
PriorityWeightPartition = 0
PriorityWeightQOS = 5000
PriorityWeightTRES = (null)
PrivateData = none
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram = /etc/slurm/slurmupdate.sh
ReconfigFlags = (null)
RequeueExit = (null)
RequeueExitHold = (null)
ResumeFailProgram = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 1
SchedulerParameters = bf_max_job_user=2
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
ScronParameters = (null)
SelectType = select/cons_tres
SelectTypeParameters = CR_CORE_MEMORY
SlurmUser = slurm(888)
SlurmctldAddr = (null)
SlurmctldDebug = info
SlurmctldHost[0] = -some-server-
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmctldPort = 6817
SlurmctldSyslogDebug = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg = (null)
SlurmctldTimeout = 120 sec
SlurmctldParameters = (null)
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdParameters = (null)
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurm
SlurmdSyslogDebug = (null)
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 23.11.1
SrunEpilog = (null)
SrunPortRange = 0-0
SrunProlog = (null)
StateSaveLocation = /var/spool/slurmctld
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendExcStates = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = INFINITE
SuspendTimeout = 30 sec
SwitchParameters = (null)
SwitchType = (null)
TaskEpilog = (null)
TaskPlugin = task/cgroup,task/affinity
TaskPluginParam = none
TaskProlog = (null)
TCPTimeout = 2 sec
TmpFS = /tmp
TopologyParam = (null)
TopologyPlugin = topology/default
TrackWCKey = No
TreeWidth = 16
UsePam = No
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
X11Parameters = (null)
Cgroup Support Configuration:
AllowedRAMSpace = 100.0%
AllowedSwapSpace = 0.0%
CgroupMountpoint = /sys/fs/cgroup
CgroupPlugin = autodetect
ConstrainCores = yes
ConstrainDevices = yes
ConstrainRAMSpace = yes
ConstrainSwapSpace = no
EnableControllers = no
IgnoreSystemd = no
IgnoreSystemdOnFailure = no
MaxRAMPercent = 100.0%
MaxSwapPercent = 100.0%
MemorySwappiness = (null)
MinRAMSpace = 30 MB
MPI Plugins Configuration:
PMIxCliTmpDirBase = (null)
PMIxCollFence = (null)
PMIxDebug = 0
PMIxDirectConn = yes
PMIxDirectConnEarly = no
PMIxDirectConnUCX = no
PMIxDirectSameArch = no
PMIxEnv = (null)
PMIxFenceBarrier = no
PMIxNetDevicesUCX = (null)
PMIxTimeout = 300
PMIxTlsUCX = (null) the partition is configured like:
PartitionName=test-partition
AllowGroups=sysadmin,users AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=vserv-[275-277]
PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=32,mem=304933M,node=4,billing=32 Our QOS looks like this:
| Name | Priority | GraceTime | Preempt | PreemptMode | Flags | UsageFactor | MaxTRESPU | MaxJobsPU | MaxSubmitPU
|-----------|----------|-----------|-----------|-------------|-----------------|-------------|-----------------|-----------|-----------
| normal | 50 | 00:00:00 | | cluster | | 1.000000 | | 20 | 50
| preempter | 1000 | 00:00:00 | preempted | gang,suspe+| | 1.000000 | | |
| preempted | 0 | 00:00:00 | | gang,suspe+ | OverPartQOS | 1.000000 | cpu=100,node=10 | | I can provide more configs if needed. Do you guys see anything strange? Or any property to be set? This is the state of the queue:
|JOBID | QOS | ST | TIME | NODELIST(REASON) | PARTITION | PRIORITY |
| ---------------- ----------- --------- ----------- ---------------------- ----------------- ----------|
| 3629 | preempted | PD | 0:00 | (Resources) | test-partition | 1 |
| 3627 | preempted | R | 0:20 | vserv-276 | test-partition | 1 |
| 3628 | preempted | R | 0:20 | vserv-277 | test-partition | 1 |
| 3626 | preempted | R | 0:27 | vserv-275 | test-partition | 1 |
| 3630 | preempter | PD | 0:00 | (Resources) | test-partition | 5000 | any advice is welcomed.
Regards,
Nischey Verma
Nischey Verma
I think you need set reasonable "DefMemPerCPU" - otherwise jobs will take all memory by default, and there is no remaining memory for the second job.
We calculated DefMemPerCPU in such way, that the default allocated memory of full node is slightly under half of total node memory. So there is always space for second job. And if user needs more, uses "--mem" option in sbatch and gets what needs.
cheers
josef
On 15. 03. 24 16:05, Verma, Nischey (HPC ENG,RAL,LSCI) via slurm-users wrote:
...