[slurm-users] Preemption not working for jobs in higher priority partition

Russell Jones arjones85 at gmail.com
Fri Aug 20 14:46:34 UTC 2021


I could have swore I had tested this before implementing it and it worked
as expected.

If I am dreaming that testing - is there a way of allowing preemption
across partitions?

On Fri, Aug 20, 2021 at 8:40 AM Brian Andrus <toomuchit at gmail.com> wrote:

> IIRC, Preemption is determined by partition first, not node.
>
> Since your pending job is in the 'day' partition, it will not preempt
> something in the 'night' partition (even if the node is in both).
>
> Brian Andrus
> On 8/19/2021 2:49 PM, Russell Jones wrote:
>
> Hi all,
>
> I could use some help to understand why preemption is not working for me
> properly. I have a job blocking other jobs that doesn't make sense to me.
> Any assistance is appreciated, thank you!
>
>
> I have two partitions defined in slurm, a day time and a night time
> pariition:
>
> Day partition - PriorityTier of 5, always Up. Limited resources under this
> QOS.
> Night partition - PriorityTier of 5 during night time, during day time set
> to Down and PriorityTier changed to 1. Jobs can be submitted to night queue
> for an unlimited QOS as long as resources are available.
>
> The thought here is jobs can continue to run in the night partition, even
> during the day time, until resources are requested from the day partition.
> Jobs would then be requeued/canceled in the night partition to
> satisfy those requirements.
>
>
>
> Current output of "scontrol show part" :
>
> PartitionName=day
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=NO QoS=part_day
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>    Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
>    PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=REQUEUE
>    State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
>    JobDefaults=(null)
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
> PartitionName=night
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=NO QoS=part_night
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>    Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=REQUEUE
>    State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
>    JobDefaults=(null)
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
>
>
> I currently have a job in the night partition that is blocking jobs in the
> day partition, even though the day partition has a PriorityTier of 5, and
> night partition is Down with a PriorityTier of 1.
>
> My current slurm.conf preemption settings are:
>
> PreemptMode=REQUEUE
> PreemptType=preempt/partition_prio
>
>
>
> The blocking job's scontrol show job output is:
>
> JobId=105713 JobName=jobname
>    Priority=1986 Nice=0 Account=xxx QOS=normal
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A
>    SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36
>    AccrueTime=2021-08-18T22:36:36
>    StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A
>    PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39
>    Partition=night AllocNode:Sid=cluster-1:1341505
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=cluster-r1n[12-13],cluster-r2n[04-06]
>    BatchHost=cluster-r1n12
>    NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=80,node=5,billing=80,gres/gpu=20
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>
>
>
> The job that is being blocked:
>
> JobId=105876 JobName=bash
>    Priority=2103 Nice=0 Account=xxx QOS=normal
>    JobState=PENDING
> Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
> Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
>    SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23
>    AccrueTime=2021-08-19T16:19:23
>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43
>    Partition=day AllocNode:Sid=cluster-1:2776451
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null)
>    NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=40,node=1,billing=40
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>
>
>
> Why is the day job not preempting the night job?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/176dbd34/attachment.htm>


More information about the slurm-users mailing list