[slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

Mon Apr 27 18:22:44 UTC 2020

Paul,

I saw your message, and while I don't have a specific suggestion for your overall situation off the top of my head, I did want to point out a pitfall our site discovered early on in our implementation of our condo model cluster, which to my knowledge still exists:

Specifically (see https://bugs.schedmd.com/show_bug.cgi?id=3881):
(by design)... "If there are _any_ jobs pending (regardless of the reason for the job still pending) in a partition with a higher Priority, no jobs from a lower Priority will be launched on nodes that are shared in common."

So if a PI partition cover a large swath of nodes (that also exist in your 'batch' partition), if ANY jobs are pending in your PI partition for any reason (say, waiting on the CPU limit), then no jobs will be launched in the 'batch' partition on nodes which also in the PI partition (assuming PI partition has higher priority).

We had to adapt the model we were planning to use many smaller higher priority 'condo' partitions (in terms of node count) instead of fewer large ones -- still not ideal, but in practice it works okay for us.

Matt Jay
Sr. HPC Systems Engineer - Hyak
Research Computing
University of Washington IT

-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Paul Brunk
Sent: Wednesday, April 22, 2020 2:44 PM
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

Hi all:

[ BTW this is the same situation that the submitter of https://bugs.schedmd.com/show_bug.cgi?id=2692 presented. ]

We have a non-Slurm cluster in production and are developing our next one, which will run Slurm 20.02.X.

We have a partition "batch" which is open to all users.  Half of the nodes are 'ownerless', while some PIs have bought nodes.  In production now, there's a distinct partition for each such PI, and her physical nodes are allocated to her partition only.

But for the Slurm cluster, we want to add the ability to have PIs buy prioritized resource allocations, rather than physical nodes.  If a PI contributed 20 nodes' worth of money (80 cores' worth, let's say), then we want it such that

(a) until either (PI has no small-enough jobs pending) or (PI is using 80
    batch-partition cores), idle batch-partition cores are allocated
    to this PI's jobs first.

(b) until the PI is using 80 batch-partition cores, her pending jobs
    small enough to fit inside the unused-by-this-PI subset of that
    80-core set will have to wait no more than 2 hours, say.

(c) the "batch" partition will have a max runtime longer than the 2hrs
    max pend time stated in the PI's SLA.  Many "batch" jobs are < 2
    hrs though.

(d) we don't pre-empt (since we don't do that here).

Defining a floating partition with GrpCores = 80, allocating it very high priority, and assigning the "batch" partition's cores to it would do much of what we want, but wouldn't have the "within two hours" part, because of the "batch" partition's max runtime.  

Does anyone know of a way to satisfy all of (a)-(d)?

As in the original posting, my thinking has only yielded this:  a floating-through-time 2-hr reservation on N cores would ensure their availability within 2 hrs.  But I'd need to automate somehow the unique availability of such reserved cores to that PI, immediately upon removal of the floating-through-time reservation on them, and also the management of the reservation's node membership.  I don't assume that a good answer resembles that at all.

Thanks for any insights!

--
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center (GACRC) Enterprise IT Svcs, the University of Georgia