[slurm-users] [External] Proposal for new TRES - "Processor Performance Units"....
Prentice Bisbal
pbisbal at pppl.gov
Fri Jun 21 20:17:55 UTC 2019
In this case, I would run LINPACK on each generation of node (either the
full node or just one core), and then somehow normalize performance. I
would recommend using the performance of a single core of the slowest
node as your basis for normalization so it has a multiplier of 1, and
then the newer systems would have a multiplier greater than 1. Then you
can take that multiplier and multiply it by the number of cores in your
different systems to get a final multiplier for a while node, if needed.
Prentice
On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>
> (...and yes, the name is inspired by a certain OEM's software
> licensing schemes...)
>
> At Brown we run a ~400 node cluster containing nodes of multiple
> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade)
> purchased in some cases by University funds and in others by
> investigator funding (~50:50). They all appear in the default SLURM
> partition. We have 3 classes of SLURM users:
>
> 1. Exploratory - no-charge access to up to 16 cores
> 2. Priority - $750/quarter for access to up to 192 cores (and with a
> GrpTRESRunMins=cpu limit). Each user has their own QoS
> 3. Condo - an investigator group who paid for nodes added to the
> cluster. The group has its own QoS and SLURM Account. The QoS
> allows use of the number of cores purchased and has a much higher
> priority than the QoS' of the "priority" users.
>
> The first problem with this scheme is that condo users who have
> purchased the older hardware now have access to the newest without
> penalty. In addition, we're encountering resistance to the idea of
> turning off their hardware and terminating their condos (despite MOUs
> stating a 5yr life). The pushback is the stated belief that the
> hardware should run until it dies.
>
> What I propose is a new TRES called a Processor Performance Unit (PPU)
> that would be specified on the Node line in slurm.conf, and used such
> that GrpTRES=ppu=N was calculated as the number of allocated cores
> multiplied by their associated PPU numbers.
>
> We could then assign a base PPU to the oldest hardware, say, "1" for
> Sandy/Ivy and increase for later architectures based on performance
> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where
> N is the number of cores of the oldest architecture multiplied by the
> configured PPU/core, X, and repeat for any newer nodes/cores the
> investigator has purchased since.
>
> The result is that the investigator group gets to run on an
> approximation of the performance that they've purchased, rather on the
> raw purchased core count.
>
> Thoughts?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190621/afc944c4/attachment.html>
More information about the slurm-users
mailing list