[slurm-users] Proposal for new TRES - "Processor Performance Units"....

Fulcomer, Samuel samuel_fulcomer at brown.edu
Wed Jun 19 21:03:38 UTC 2019


Hi Alex,

Thanks. The issue is that we don't know where they'll end up running in the
heterogenous environment. In addition, because the limit is applied by
GrpTRES=cpu=N, someone buying 100 cores today shouldn't get access to 130
of todays cores.

Regards,
Sam

On Wed, Jun 19, 2019 at 3:41 PM Alex Chekholko <alex at calicolabs.com> wrote:

> Hey Samuel,
>
> Can't you just adjust the existing "cpu" limit numbers using those same
> multipliers?  Someone bought 100 CPUs 5 years ago, now that's ~70 CPUs.
>
> Or vice versa, someone buys 100 CPUs today, they get a setting of 130 CPUs
> because the CPUs are normalized to the old performance.  Since it would
> probably look bad politically to reduce someone's number, but giving a new
> customer a larger number should be fine.
>
> Regards,
> Alex
>
> On Wed, Jun 19, 2019 at 12:32 PM Fulcomer, Samuel <
> samuel_fulcomer at brown.edu> wrote:
>
>>
>> (...and yes, the name is inspired by a certain OEM's software licensing
>> schemes...)
>>
>> At Brown we run a ~400 node cluster containing nodes of multiple
>> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
>> some cases by University funds and in others by investigator funding
>> (~50:50).  They all appear in the default SLURM partition. We have 3
>> classes of SLURM users:
>>
>>
>>    1. Exploratory - no-charge access to up to 16 cores
>>    2. Priority - $750/quarter for access to up to 192 cores (and with a
>>    GrpTRESRunMins=cpu limit). Each user has their own QoS
>>    3. Condo - an investigator group who paid for nodes added to the
>>    cluster. The group has its own QoS and SLURM Account. The QoS allows use of
>>    the number of cores purchased and has a much higher priority than the QoS'
>>    of the "priority" users.
>>
>> The first problem with this scheme is that condo users who have purchased
>> the older hardware now have access to the newest without penalty. In
>> addition, we're encountering resistance to the idea of turning off their
>> hardware and terminating their condos (despite MOUs stating a 5yr life).
>> The pushback is the stated belief that the hardware should run until it
>> dies.
>>
>> What I propose is a new TRES called a Processor Performance Unit (PPU)
>> that would be specified on the Node line in slurm.conf, and used such that
>> GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by
>> their associated PPU numbers.
>>
>> We could then assign a base PPU to the oldest hardware, say, "1" for
>> Sandy/Ivy and increase for later architectures based on performance
>> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is
>> the number of cores of the oldest architecture multiplied by the configured
>> PPU/core, X, and repeat for any newer nodes/cores the investigator has
>> purchased since.
>>
>> The result is that the investigator group gets to run on an approximation
>> of the performance that they've purchased, rather on the raw purchased core
>> count.
>>
>> Thoughts?
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190619/bbb08737/attachment.html>


More information about the slurm-users mailing list