[slurm-users] Heterogenous Memory, Partition limits, node preference and backfill.

Wed Apr 7 19:08:42 UTC 2021

Hi Everyone,

We have a challenge with scheduling jobs in a partition comprised of heterogenous nodes with respect to memory and cores [1]. We further use cores as the unit of measure for charging users. Currently we implement a crude mechanism of using MaxMemPerCore as a proxy for memory use, to charge for memory use. In the partition in question, we have nodes with 256GB, 384GB and 768GB of RAM. The 384GB and 256GB nodes have different core counts, but work out close to ~9GB/core; the 768 nodes are roughly ~18GB/core.  The default memory request for the partition is set to this same amount, and will remain unchanged.  This partition is really for HTC, so the max node limit is set to 2 and will remain there (the parallel partition is homogenous).

So, if we increase that MaxMemPerCore number, we'll potentially have a lot of nodes with un-schedulable cores (no memory left) and if we leave it where it is, the extra 384GB in the larger nodes won't ever get used. Of these two, the former is preferable, even though the charge for memory is effectively halved (that's fine, most allocations are monopoly money anyway). We really just want to optimize job placement for throughput without having to create a separate partition.

What we're concerned about is this: we don't believe the scheduler will be smart about job placement - placing larger memory jobs preferentially on nodes with more total memory and smaller memory jobs on the smaller memory nodes. To address this, we're thinking of just weighting the smaller memory nodes more, so that jobs get placed there first, and only get bumped to the larger memory nodes when there are larger memory requests and when the smaller nodes are already full.

We'd also like this scheme to limit backfill of small jobs on the larger nodes. Ideally, if we can get this to work, we'd extend it be getting rid of the "largemem" (1-3TB nodes) partition and putting those nodes into this single partition (many of our largemem users could easily fit individual jobs <768GB).  I have had good results on a small cluster of very heterogenous nodes all in one large partition with just letting the scheduler handle things, and it worked reasonably well, with the exception of some very large (bordering on --exclusive or explicitly --exclusive) jobs starving because of small job backfill.

Has anyone (everyone?) tried to deal with this? We're going to go ahead and try out this scheme (it seems pretty straightforward), but I wanted to get a sense of what other installations are doing.

Best,

Scott Ruffner
University of Virginia Research Computing

[1] Our cluster grows sort of organically as we have an unpredictable budget, and can't plan for forklift replacement of partitions (nodes) on regular lifecycle periods.

--
Scott Ruffner
Senior HPC Engineer
UVa Research Computing
(434)924-6778(o)
(434)295-0250(h)
sruffner at virginia.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210407/41dd5d19/attachment.htm>