[slurm-users] Preemption not working for jobs in higher priority partition
Russell Jones
arjones85 at gmail.com
Fri Aug 20 14:46:34 UTC 2021
I could have swore I had tested this before implementing it and it worked
as expected.
If I am dreaming that testing - is there a way of allowing preemption
across partitions?
On Fri, Aug 20, 2021 at 8:40 AM Brian Andrus <toomuchit at gmail.com> wrote:
> IIRC, Preemption is determined by partition first, not node.
>
> Since your pending job is in the 'day' partition, it will not preempt
> something in the 'night' partition (even if the node is in both).
>
> Brian Andrus
> On 8/19/2021 2:49 PM, Russell Jones wrote:
>
> Hi all,
>
> I could use some help to understand why preemption is not working for me
> properly. I have a job blocking other jobs that doesn't make sense to me.
> Any assistance is appreciated, thank you!
>
>
> I have two partitions defined in slurm, a day time and a night time
> pariition:
>
> Day partition - PriorityTier of 5, always Up. Limited resources under this
> QOS.
> Night partition - PriorityTier of 5 during night time, during day time set
> to Down and PriorityTier changed to 1. Jobs can be submitted to night queue
> for an unlimited QOS as long as resources are available.
>
> The thought here is jobs can continue to run in the night partition, even
> during the day time, until resources are requested from the day partition.
> Jobs would then be requeued/canceled in the night partition to
> satisfy those requirements.
>
>
>
> Current output of "scontrol show part" :
>
> PartitionName=day
> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> AllocNodes=ALL Default=NO QoS=part_day
> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
> MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
> Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
> PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
> OverTimeLimit=NONE PreemptMode=REQUEUE
> State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
> JobDefaults=(null)
> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
> PartitionName=night
> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> AllocNodes=ALL Default=NO QoS=part_night
> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
> MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
> Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
> OverTimeLimit=NONE PreemptMode=REQUEUE
> State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
> JobDefaults=(null)
> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
>
>
> I currently have a job in the night partition that is blocking jobs in the
> day partition, even though the day partition has a PriorityTier of 5, and
> night partition is Down with a PriorityTier of 1.
>
> My current slurm.conf preemption settings are:
>
> PreemptMode=REQUEUE
> PreemptType=preempt/partition_prio
>
>
>
> The blocking job's scontrol show job output is:
>
> JobId=105713 JobName=jobname
> Priority=1986 Nice=0 Account=xxx QOS=normal
> JobState=RUNNING Reason=None Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A
> SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36
> AccrueTime=2021-08-18T22:36:36
> StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A
> PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None
> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39
> Partition=night AllocNode:Sid=cluster-1:1341505
> ReqNodeList=(null) ExcNodeList=(null)
> NodeList=cluster-r1n[12-13],cluster-r2n[04-06]
> BatchHost=cluster-r1n12
> NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=80,node=5,billing=80,gres/gpu=20
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
> Features=(null) DelayBoot=00:00:00
> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>
>
>
> The job that is being blocked:
>
> JobId=105876 JobName=bash
> Priority=2103 Nice=0 Account=xxx QOS=normal
> JobState=PENDING
> Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
> Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
> RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
> SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23
> AccrueTime=2021-08-19T16:19:23
> StartTime=Unknown EndTime=Unknown Deadline=N/A
> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43
> Partition=day AllocNode:Sid=cluster-1:2776451
> ReqNodeList=(null) ExcNodeList=(null)
> NodeList=(null)
> NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=40,node=1,billing=40
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
> Features=(null) DelayBoot=00:00:00
> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>
>
>
> Why is the day job not preempting the night job?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/176dbd34/attachment.htm>
More information about the slurm-users
mailing list