[slurm-users] hi-priority partition and preemption

Thu May 25 15:55:11 UTC 2023

After trying to approach this with preempt/partition_prio, we ended up moving to QOS based preemption due to some issues with suspend/requeue, and also wanting to use QOS for quicker/easier tweaks than changing partitions as a whole.

> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG
> PartitionName=part-lopri     Nodes=nodes[000-NNN]    Default=NO          MaxTime=INFINITE    OverSubscribe=FORCE:1   PriorityTier=10     State=UP
> PartitionName=part-hipir     Nodes=nodes[000-NNN]    Default=NO          MaxTime=INFINITE    OverSubscribe=NO        PriorityTier=100    State=UP        PreemptMode=OFF

We then have a few QOS that have different Priority values, as well as PreemptMode, QOS it can preempt, etc.
>       Name   Priority    Preempt PreemptMode
> ---------- ---------- ---------- -----------
>         rq         10                requeue
>       susp         11                suspend 
>      hipri        100    rq,susp     cluster
>       test         50         rq     requeue

The rq qos is stateless and can be requeued, susp qos is stateful and needs to be suspended.
Hipri can preempt rq and susp.
We also have a test qos with very strict limits (wall clock, job count, tres count, etc) that allows small jobs to jump the queue, for quick testing before submitting into the full queue.

The tricky part for us was that we have some stateful jobs that need to be suspended, and some stateless jobs that can just be requeued without issue.
But we want the hipri partition to take precedent, on the same hardware pool.
We also didn’t want gang scheduling to flip flop jobs running, which if memory serves me correctly, was how/why we ended up going with duplicative partitions for the purpose of priority, because we couldn’t get preemption to work intra-partition correctly.
In a perfect world, we would have just the single partition and everything handled in QOS, but it’s working, and that’s what mattered.

I’m not sure how any of this would work with FORCE:20 oversubscribe, but hopefully it offers something useful to try next.

Reed

> On May 24, 2023, at 8:42 AM, Groner, Rob <rug262 at psu.edu> wrote:
> 
> What you are describing is definitely doable.  We have our system setup similarly.  All nodes are in the "open" partition and "prio" partition, but a job submitted to the "prio" partition will preempt the open jobs.
> 
> I don't see anything clearly wrong with your slurm.conf settings.  Ours are very similar, though we use only FORCE:1 for oversubscribe.  You might try that just to see if there's a difference.
> 
> What are the sbatch settings you are using when you submit the jobs?
> 
> Do you have PreemptExemptTime set to anything in slurm.conf?
> 
> What is the reason squeue gives for the high priority jobs to be pending?
> 
> For your "run regularly" goal, you might consider scrontab.  If we can figure out priority and preemption, then that will start the job at a regular time.
> 
> Rob
> 
> From: slurm-users <slurm-users-bounces at lists.schedmd.com <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Fabrizio Roccato <f.roccato at isac.cnr.it <mailto:f.roccato at isac.cnr.it>>
> Sent: Wednesday, May 24, 2023 7:17 AM
> To: slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com> <slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>>
> Subject: [slurm-users] hi-priority partition and preemption
>  
> [You don't often get email from f.roccato at isac.cnr.it <mailto:f.roccato at isac.cnr.it>. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification <https://aka.ms/LearnAboutSenderIdentification> ]
> 
> Hi all,
>         i'm trying to have two overlapping partition, say normal and hi-pri,
> so that when jobs are launched in the second one they can preempt the jobs
> allready running in the first one, automatically putting them in suspend
> state. After completition, the jobs in the normal partition must be
> automatically resumed.
> 
> here are my (relevant) slurm.conf settings:
> 
> > PreemptMode=suspend,gang
> > PreemptType=preempt/partition_prio
> >
> > PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend
> > PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off
> 
> But so, jobs in the hi-pri partition where put in PD state and the ones
> allready running in the normal partition continue in their R status.
> What  i'm wrong? What i'm missing?
> 
> Since i have jobs thath must run at specific time and must have priority over
> all others, is this the correct way to do?
> 
> 
> Thanks
> 
> FR

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230525/68b5d168/attachment.htm>