[slurm-users] Proposal for new TRES - "Processor Performance Units"....

Thu Jun 20 06:37:00 UTC 2019

On 19/06/2019 22.30, Fulcomer, Samuel wrote:
> 
> (...and yes, the name is inspired by a certain OEM's software licensing
> schemes...)
> 
> At Brown we run a ~400 node cluster containing nodes of multiple
> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased
> in some cases by University funds and in others by investigator funding
> (~50:50).  They all appear in the default SLURM partition. We have 3
> classes of SLURM users:
> 
>  1. Exploratory - no-charge access to up to 16 cores
>  2. Priority - $750/quarter for access to up to 192 cores (and with a
>     GrpTRESRunMins=cpu limit). Each user has their own QoS
>  3. Condo - an investigator group who paid for nodes added to the
>     cluster. The group has its own QoS and SLURM Account. The QoS allows
>     use of the number of cores purchased and has a much higher priority
>     than the QoS' of the "priority" users.
> 
> The first problem with this scheme is that condo users who have
> purchased the older hardware now have access to the newest without
> penalty. In addition, we're encountering resistance to the idea of
> turning off their hardware and terminating their condos (despite MOUs
> stating a 5yr life). The pushback is the stated belief that the hardware
> should run until it dies.
> 
> What I propose is a new TRES called a Processor Performance Unit (PPU)
> that would be specified on the Node line in slurm.conf, and used such
> that GrpTRES=ppu=N was calculated as the number of allocated cores
> multiplied by their associated PPU numbers.
> 
> We could then assign a base PPU to the oldest hardware, say, "1" for
> Sandy/Ivy and increase for later architectures based on performance
> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N
> is the number of cores of the oldest architecture multiplied by the
> configured PPU/core, X, and repeat for any newer nodes/cores the
> investigator has purchased since.
> 
> The result is that the investigator group gets to run on an
> approximation of the performance that they've purchased, rather on the
> raw purchased core count.
> 
> Thoughts?
> 
> 

What we do is that our nodes are grouped into separate partitions based
on the CPU model. E.g. the partition "batch-skl" is where our Skylake
(6148) nodes are. The we have a job_submit.lua script which sends jobs
without an explicit partition spec to all batch-xxx partitions (checking
constraints etc. along the way). Then for each partition we set
TRESBillingWeights= to "normalize" the fairshare consumption based on
the geometric mean of a set of hopefully not too unrepresentative
single-node benchmarks [1].

We also set a memory billing weight, and have MAX_TRES among our
PriorityFlags, approximating dominant resource fairness (DRF) [2]

[1] https://github.com/AaltoScienceIT/docker-fgci-benchmark

[2] https://people.eecs.berkeley.edu/~alig/papers/drf.pdf

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqvist at aalto.fi