[slurm-users] [External] Proposal for new TRES - "Processor Performance Units"....

Prentice Bisbal pbisbal at pppl.gov
Fri Jun 21 20:17:55 UTC 2019


In this case, I would run LINPACK on each generation of node (either the 
full node or just one core), and then somehow normalize performance. I  
would recommend using the performance of a single core of the slowest 
node as your basis for normalization so it has a multiplier of 1, and 
then the newer systems would have a multiplier greater than 1. Then you 
can take that multiplier and multiply it by the number of cores in your 
different systems to get a final multiplier for a while node, if needed.

Prentice

On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>
> (...and yes, the name is inspired by a certain OEM's software 
> licensing schemes...)
>
> At Brown we run a ~400 node cluster containing nodes of multiple 
> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) 
> purchased in some cases by University funds and in others by 
> investigator funding (~50:50).  They all appear in the default SLURM 
> partition. We have 3 classes of SLURM users:
>
>  1. Exploratory - no-charge access to up to 16 cores
>  2. Priority - $750/quarter for access to up to 192 cores (and with a
>     GrpTRESRunMins=cpu limit). Each user has their own QoS
>  3. Condo - an investigator group who paid for nodes added to the
>     cluster. The group has its own QoS and SLURM Account. The QoS
>     allows use of the number of cores purchased and has a much higher
>     priority than the QoS' of the "priority" users.
>
> The first problem with this scheme is that condo users who have 
> purchased the older hardware now have access to the newest without 
> penalty. In addition, we're encountering resistance to the idea of 
> turning off their hardware and terminating their condos (despite MOUs 
> stating a 5yr life). The pushback is the stated belief that the 
> hardware should run until it dies.
>
> What I propose is a new TRES called a Processor Performance Unit (PPU) 
> that would be specified on the Node line in slurm.conf, and used such 
> that GrpTRES=ppu=N was calculated as the number of allocated cores 
> multiplied by their associated PPU numbers.
>
> We could then assign a base PPU to the oldest hardware, say, "1" for 
> Sandy/Ivy and increase for later architectures based on performance 
> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where 
> N is the number of cores of the oldest architecture multiplied by the 
> configured PPU/core, X, and repeat for any newer nodes/cores the 
> investigator has purchased since.
>
> The result is that the investigator group gets to run on an 
> approximation of the performance that they've purchased, rather on the 
> raw purchased core count.
>
> Thoughts?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190621/afc944c4/attachment.html>


More information about the slurm-users mailing list