[slurm-users] Preemption not working for jobs in higher priority partition
Russell Jones
arjones85 at gmail.com
Thu Aug 19 21:49:05 UTC 2021
Hi all,
I could use some help to understand why preemption is not working for me
properly. I have a job blocking other jobs that doesn't make sense to me.
Any assistance is appreciated, thank you!
I have two partitions defined in slurm, a day time and a night time
pariition:
Day partition - PriorityTier of 5, always Up. Limited resources under this
QOS.
Night partition - PriorityTier of 5 during night time, during day time set
to Down and PriorityTier changed to 1. Jobs can be submitted to night queue
for an unlimited QOS as long as resources are available.
The thought here is jobs can continue to run in the night partition, even
during the day time, until resources are requested from the day partition.
Jobs would then be requeued/canceled in the night partition to
satisfy those requirements.
Current output of "scontrol show part" :
PartitionName=day
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_day
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
PartitionName=night
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_night
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
I currently have a job in the night partition that is blocking jobs in the
day partition, even though the day partition has a PriorityTier of 5, and
night partition is Down with a PriorityTier of 1.
My current slurm.conf preemption settings are:
PreemptMode=REQUEUE
PreemptType=preempt/partition_prio
The blocking job's scontrol show job output is:
JobId=105713 JobName=jobname
Priority=1986 Nice=0 Account=xxx QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36
AccrueTime=2021-08-18T22:36:36
StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A
PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39
Partition=night AllocNode:Sid=cluster-1:1341505
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cluster-r1n[12-13],cluster-r2n[04-06]
BatchHost=cluster-r1n12
NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=80,node=5,billing=80,gres/gpu=20
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
The job that is being blocked:
JobId=105876 JobName=bash
Priority=2103 Nice=0 Account=xxx QOS=normal
JobState=PENDING
Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23
AccrueTime=2021-08-19T16:19:23
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43
Partition=day AllocNode:Sid=cluster-1:2776451
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=40,node=1,billing=40
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Why is the day job not preempting the night job?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/bdecefbc/attachment.htm>
More information about the slurm-users
mailing list