[slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

Thu Apr 23 13:24:06 UTC 2020

You could probably satisfy this by using a combination of fairshare and 
QoS's.  You could also tier partitions with a priority partition and 
then a normal partition and then set a QoS on the priority partition 
limiting maximum size.  You would naturally want to turn off preemption 
so that only priority governs.

In our set up we generally just rely on fairshare to adjudicate and let 
things float.  That said we do have a lot of legacy partitions that we 
are trying to ditch to get rid of fragmentation but its a slow process 
as PI's like to hug their hardware.  In our ideal set up we would have 
everything governed purely by fairshare with one large queue and no QoS's

For your setup though I think a combination of QoS's and partition 
layout would fit the bill.

-Paul Edmon-

On 4/22/2020 5:43 PM, Paul Brunk wrote:
> Hi all:
>
> [ BTW this is the same situation that the submitter of https://bugs.schedmd.com/show_bug.cgi?id=2692 presented. ]
>
> We have a non-Slurm cluster in production and are developing our next one, which will run Slurm 20.02.X.
>
> We have a partition "batch" which is open to all users.  Half of the nodes are 'ownerless', while some PIs have bought nodes.  In production now, there's a distinct partition for each such PI, and her physical nodes are allocated to her partition only.
>
> But for the Slurm cluster, we want to add the ability to have PIs buy prioritized resource allocations, rather than physical nodes.  If a PI contributed 20 nodes' worth of money (80 cores' worth, let's say), then we want it such that
>
> (a) until either (PI has no small-enough jobs pending) or (PI is using 80
>      batch-partition cores), idle batch-partition cores are allocated
>      to this PI's jobs first.
>
> (b) until the PI is using 80 batch-partition cores, her pending jobs
>      small enough to fit inside the unused-by-this-PI subset of that
>      80-core set will have to wait no more than 2 hours, say.
>
> (c) the "batch" partition will have a max runtime longer than the 2hrs
>      max pend time stated in the PI's SLA.  Many "batch" jobs are < 2
>      hrs though.
>
> (d) we don't pre-empt (since we don't do that here).
>
> Defining a floating partition with GrpCores = 80, allocating it very high priority, and assigning the "batch" partition's cores to it would do much of what we want, but wouldn't have the "within two hours" part, because of the "batch" partition's max runtime.
>
> Does anyone know of a way to satisfy all of (a)-(d)?
>
> As in the original posting, my thinking has only yielded this:  a floating-through-time 2-hr reservation on N cores would ensure their availability within 2 hrs.  But I'd need to automate somehow the unique availability of such reserved cores to that PI, immediately upon removal of the floating-through-time reservation on them, and also the management of the reservation's node membership.  I don't assume that a good answer resembles that at all.
>
> Thanks for any insights!
>
> --
> Paul Brunk, system administrator
> Georgia Advanced Computing Resource Center (GACRC)
> Enterprise IT Svcs, the University of Georgia
>