[slurm-users] Proposal for new TRES - "Processor Performance Units"....

Fulcomer, Samuel samuel_fulcomer at brown.edu
Wed Jun 19 19:30:04 UTC 2019


(...and yes, the name is inspired by a certain OEM's software licensing
schemes...)

At Brown we run a ~400 node cluster containing nodes of multiple
architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
some cases by University funds and in others by investigator funding
(~50:50).  They all appear in the default SLURM partition. We have 3
classes of SLURM users:


   1. Exploratory - no-charge access to up to 16 cores
   2. Priority - $750/quarter for access to up to 192 cores (and with a
   GrpTRESRunMins=cpu limit). Each user has their own QoS
   3. Condo - an investigator group who paid for nodes added to the
   cluster. The group has its own QoS and SLURM Account. The QoS allows use of
   the number of cores purchased and has a much higher priority than the QoS'
   of the "priority" users.

The first problem with this scheme is that condo users who have purchased
the older hardware now have access to the newest without penalty. In
addition, we're encountering resistance to the idea of turning off their
hardware and terminating their condos (despite MOUs stating a 5yr life).
The pushback is the stated belief that the hardware should run until it
dies.

What I propose is a new TRES called a Processor Performance Unit (PPU) that
would be specified on the Node line in slurm.conf, and used such that
GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by
their associated PPU numbers.

We could then assign a base PPU to the oldest hardware, say, "1" for
Sandy/Ivy and increase for later architectures based on performance
improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is
the number of cores of the oldest architecture multiplied by the configured
PPU/core, X, and repeat for any newer nodes/cores the investigator has
purchased since.

The result is that the investigator group gets to run on an approximation
of the performance that they've purchased, rather on the raw purchased core
count.

Thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190619/ae85278e/attachment.html>


More information about the slurm-users mailing list