Hi All,
we are trying to implement preemption in one of our partitions so we can run priority jobs on it and suspend the ones running on the partition and resume once the priority job is done. We have read through the Slurm documentation and did the configuration, but somehow we can not make it work. We tried other preemption like cancel which works fine, but Suspend isn't working. The jobs with higher priority which should suspend other jobs stays as pending and waiting for resources. We are using a QOS to assign priority to the job and also configuring the QOS so it can preempt certain other QOSs. this is the output of our scontrol show config :
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost = -Some-server-
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort = 6819
AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreFlags = job_comment,job_env,job_extra,job_script
AcctGatherEnergyType = (null)
AcctGatherFilesystemType = (null)
AcctGatherInterconnectType = (null)
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = (null)
AllowSpecResourcesUsage = No
AuthAltTypes = (null)
AuthAltParameters = (null)
AuthInfo = (null)
AuthType = auth/munge
BatchStartTimeout = 10 sec
BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters = (null)
BOOT_TIME = 2024-03-15T13:32:05
BurstBufferType = (null)
CliFilterPlugins = (null)
ClusterName = cluster
CommunicationParameters = (null)
CompleteWait = 0 sec
CoreSpecPlugin = (null)
CpuFreqDef = Unknown
CpuFreqGovernors = OnDemand,Performance,UserSpace
CredType = cred/munge
DebugFlags = (null)
DefMemPerCPU = 3500
DependencyParameters = (null)
DisableRootJobs = Yes
EioTimeout = 60
EnforcePartLimits = ALL
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = (null)
ExtSensorsFreq = 0 sec
FairShareDampeningFactor = 1
FederationParameters = (null)
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = gpu
GpuFreqDef = (null)
GroupUpdateForce = 1
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 300 sec
HealthCheckNodeState = ANY
HealthCheckProgram = /usr/sbin/nhc
InactiveLimit = 0 sec
InteractiveStepOptions = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)
JobCompHost = localhost
JobCompLoc = (null)
JobCompParams = (null)
JobCompPort = 0
JobCompType = (null)
JobCompUser = root
JobContainerType = (null)
JobDefaults = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = lua
KillOnBadExit = 0
KillWait = 30 sec
LaunchParameters = (null)
Licenses = (null)
LogTimeFormat = iso8601_ms
MailDomain = (null)
MailProg = /bin/mail
MaxArraySize = 1001
MaxBatchRequeue = 5
MaxDBDMsgs = 20024
MaxJobCount = 10000
MaxJobId = 67043328
MaxMemPerNode = UNLIMITED
MaxNodeCount = 6
MaxStepCount = 40000
MaxTasksPerNode = 512
MCSPlugin = (null)
MCSParameters = (null)
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = pmix_v2
MpiParams = (null)
NEXT_JOB_ID = 3626
NodeFeaturesPlugins = (null)
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = (null)
PowerParameters = (null)
PowerPlugin = (null)
PreemptMode = GANG,SUSPEND
PreemptParameters = (null)
PreemptType = preempt/qos
PreemptExemptTime = 00:00:00
PrEpParameters = (null)
PrEpPlugins = prep/script
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags =
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 0
PriorityWeightAssoc = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize = 0
PriorityWeightPartition = 0
PriorityWeightQOS = 5000
PriorityWeightTRES = (null)
PrivateData = none
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram = /etc/slurm/slurmupdate.sh
ReconfigFlags = (null)
RequeueExit = (null)
RequeueExitHold = (null)
ResumeFailProgram = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 1
SchedulerParameters = bf_max_job_user=2
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
ScronParameters = (null)
SelectType = select/cons_tres
SelectTypeParameters = CR_CORE_MEMORY
SlurmUser = slurm(888)
SlurmctldAddr = (null)
SlurmctldDebug = info
SlurmctldHost[0] = -some-server-
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmctldPort = 6817
SlurmctldSyslogDebug = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg = (null)
SlurmctldTimeout = 120 sec
SlurmctldParameters = (null)
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdParameters = (null)
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurm
SlurmdSyslogDebug = (null)
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 23.11.1
SrunEpilog = (null)
SrunPortRange = 0-0
SrunProlog = (null)
StateSaveLocation = /var/spool/slurmctld
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendExcStates = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = INFINITE
SuspendTimeout = 30 sec
SwitchParameters = (null)
SwitchType = (null)
TaskEpilog = (null)
TaskPlugin = task/cgroup,task/affinity
TaskPluginParam = none
TaskProlog = (null)
TCPTimeout = 2 sec
TmpFS = /tmp
TopologyParam = (null)
TopologyPlugin = topology/default
TrackWCKey = No
TreeWidth = 16
UsePam = No
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
X11Parameters = (null)
Cgroup Support Configuration:
AllowedRAMSpace = 100.0%
AllowedSwapSpace = 0.0%
CgroupMountpoint = /sys/fs/cgroup
CgroupPlugin = autodetect
ConstrainCores = yes
ConstrainDevices = yes
ConstrainRAMSpace = yes
ConstrainSwapSpace = no
EnableControllers = no
IgnoreSystemd = no
IgnoreSystemdOnFailure = no
MaxRAMPercent = 100.0%
MaxSwapPercent = 100.0%
MemorySwappiness = (null)
MinRAMSpace = 30 MB
MPI Plugins Configuration:
PMIxCliTmpDirBase = (null)
PMIxCollFence = (null)
PMIxDebug = 0
PMIxDirectConn = yes
PMIxDirectConnEarly = no
PMIxDirectConnUCX = no
PMIxDirectSameArch = no
PMIxEnv = (null)
PMIxFenceBarrier = no
PMIxNetDevicesUCX = (null)
PMIxTimeout = 300
PMIxTlsUCX = (null)
the partition is configured like:
PartitionName=test-partition
AllowGroups=sysadmin,users AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=vserv-[275-277]
PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=32,mem=304933M,node=4,billing=32
Our QOS looks like this:
| Name | Priority | GraceTime | Preempt | PreemptMode | Flags | UsageFactor | MaxTRESPU | MaxJobsPU | MaxSubmitPU
|-----------|----------|-----------|-----------|-------------|-----------------|-------------|-----------------|-----------|-----------
| normal | 50 | 00:00:00 | | cluster | | 1.000000 | | 20 | 50
| preempter | 1000 | 00:00:00 | preempted | gang,suspe+| | 1.000000 | | |
| preempted | 0 | 00:00:00 | | gang,suspe+ | OverPartQOS | 1.000000 | cpu=100,node=10 | |
I can provide more configs if needed. Do you guys see anything strange? Or any property to be set?
This is the state of the queue:
|JOBID | QOS | ST | TIME | NODELIST(REASON) | PARTITION | PRIORITY |
| ---------------- ----------- --------- ----------- ---------------------- ----------------- ----------|
| 3629 | preempted | PD | 0:00 | (Resources) | test-partition | 1 |
| 3627 | preempted | R | 0:20 | vserv-276 | test-partition | 1 |
| 3628 | preempted | R | 0:20 | vserv-277 | test-partition | 1 |
| 3626 | preempted | R | 0:27 | vserv-275 | test-partition | 1 |
| 3630 | preempter | PD | 0:00 | (Resources) | test-partition | 5000 |
any advice is welcomed.
Regards,
Nischey Verma
Nischey Verma