[slurm-users] Preemption not working for jobs in higher priority partition

Russell Jones arjones85 at gmail.com
Thu Aug 19 21:49:05 UTC 2021


Hi all,

I could use some help to understand why preemption is not working for me
properly. I have a job blocking other jobs that doesn't make sense to me.
Any assistance is appreciated, thank you!


I have two partitions defined in slurm, a day time and a night time
pariition:

Day partition - PriorityTier of 5, always Up. Limited resources under this
QOS.
Night partition - PriorityTier of 5 during night time, during day time set
to Down and PriorityTier changed to 1. Jobs can be submitted to night queue
for an unlimited QOS as long as resources are available.

The thought here is jobs can continue to run in the night partition, even
during the day time, until resources are requested from the day partition.
Jobs would then be requeued/canceled in the night partition to
satisfy those requirements.



Current output of "scontrol show part" :

PartitionName=day
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=part_day
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
   PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED


PartitionName=night
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=part_night
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED




I currently have a job in the night partition that is blocking jobs in the
day partition, even though the day partition has a PriorityTier of 5, and
night partition is Down with a PriorityTier of 1.

My current slurm.conf preemption settings are:

PreemptMode=REQUEUE
PreemptType=preempt/partition_prio



The blocking job's scontrol show job output is:

JobId=105713 JobName=jobname
   Priority=1986 Nice=0 Account=xxx QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36
   AccrueTime=2021-08-18T22:36:36
   StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A
   PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39
   Partition=night AllocNode:Sid=cluster-1:1341505
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cluster-r1n[12-13],cluster-r2n[04-06]
   BatchHost=cluster-r1n12
   NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=80,node=5,billing=80,gres/gpu=20
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)



The job that is being blocked:

JobId=105876 JobName=bash
   Priority=2103 Nice=0 Account=xxx QOS=normal
   JobState=PENDING
Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23
   AccrueTime=2021-08-19T16:19:23
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43
   Partition=day AllocNode:Sid=cluster-1:2776451
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)



Why is the day job not preempting the night job?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/bdecefbc/attachment.htm>


More information about the slurm-users mailing list