[slurm-users] Proposal for new TRES - "Processor Performance Units"....

Thu Jun 20 14:42:32 UTC 2019

I don't know off hand.  You can sort of construct a similar system in 
Slurm, but I've never seen it as a native option.

-Paul Edmon-

On 6/20/19 10:32 AM, John Hearns wrote:
> Paul, you refer to banking resources. Which leads me to ask are 
> schemes such as Gold used these days in Slurm?
> Gold was a utility where groups could top up with a virtual amount of 
> money which would be spent as they consume resources.
> Altair also wrote a similar system for PBS, which they offered to us 
> when I was in Formula 1 - it was quite a good system, and at the time
> we had a requirement for allocating resources to groups of users.
>
> I guess the sophisticated fairshare mechanisms discussed in this 
> thread make schemes like Gold obsolete.
>
> On Thu, 20 Jun 2019 at 15:24, Paul Edmon <pedmon at cfa.harvard.edu 
> <mailto:pedmon at cfa.harvard.edu>> wrote:
>
>     People will specify which partition they need or if they want
>     multiple they use this:
>
>     #SBATCH -p general,shared,serial_requeue
>
>     As then the scheduler will just select which partition they will
>     run in first.  Naturally there is a risk that you will end up
>     running in a more expensive partition.
>
>     Our time limit is only applied to our public partitions, our owned
>     partitions (of which we have roughly 80) have no time limit.  So
>     if they run on their dedicated resources they have no penalty. 
>     We've been working on getting rid of owned partitions and moving
>     to a school/department based partition, where all the purchased
>     resources for different PI's go into the same bucket where they
>     compete against themselves and not the wider community.  We've
>     found that this ends up working pretty well as most PI's only used
>     their purchased resources sporadically.  Thus there are usually
>     idle cores lying around that we backfill with our serial queues. 
>     Since those are requeueable we can get immediate response to
>     access that idle space.  We are also toying with a high priority
>     partition that is open to people with high fairshare so that they
>     can get immediate response as those with high fairshare tend to be
>     bursty users.
>
>     Our current halflife is set to a month and we keep 6 months of
>     data in our database.  I'd actually like to get rid of the
>     halflife and just go to a 3 month moving window to allow people to
>     bank their fairshare, but we haven't done that yet as people have
>     been having a hard enough time understanding our current system. 
>     It's not due to its complexity but more that most people just flat
>     out aren't cognizant of their usage and think the resource is
>     functionally infinite.
>
>     -Paul Edmon-
>
>     On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
>>     Hi Paul,
>>
>>     Thanks..Your setup is interesting. I see that you have your
>>     processor types segregated in their own partitions (with the
>>     exception of of the requeue partition), and that's how you get at
>>     the weighting mechanism. Do you have your users explicitly
>>     specify multiple partitions in the batch commands/scripts in
>>     order to take advantage of this, or do you use a plugin for it?
>>
>>     It sounds like you don't impose any hard limit on simultaneous
>>     resource use, and allow everything to fairshare out with the help
>>     of the 7 day TimeLimit. We haven't been imposing any TimeLimit on
>>     our condo users, which would be an issue for us with your config.
>>     For our exploratory and priority users, we impose an effective
>>     time limit with GrpTRESRunMins=cpu (and gres/gpu= for the GPU
>>     usage). In addition, since we have so many priority users, we
>>     don't explicitly set a rawshare value for them (they all execute
>>     under the "default" account). We set rawshare for the condo
>>     accounts as cores-purchased/total-cores*1000.
>>
>>     What's your fairshare decay setting (don't remember the proper
>>     name at the moment)?
>>
>>     Regards,
>>     Sam
>>
>>
>>
>>     On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon
>>     <pedmon at cfa.harvard.edu <mailto:pedmon at cfa.harvard.edu>> wrote:
>>
>>         We do a similar thing here at Harvard:
>>
>>         https://www.rc.fas.harvard.edu/fairshare/
>>
>>         We simply weight all the partitions based on their core type
>>         and then we allocate Shares for each account based on what
>>         they have purchased. We don't use QoS at all, so we just rely
>>         purely on fairshare weighting for resource usage.  It has
>>         worked pretty well for our purposes.
>>
>>         -Paul Edmon-
>>
>>         On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>>>
>>>         (...and yes, the name is inspired by a certain OEM's
>>>         software licensing schemes...)
>>>
>>>         At Brown we run a ~400 node cluster containing nodes of
>>>         multiple architectures (Sandy/Ivy, Haswell/Broadwell, and
>>>         Sky/Cascade) purchased in some cases by University funds and
>>>         in others by investigator funding (~50:50).  They all appear
>>>         in the default SLURM partition. We have 3 classes of SLURM
>>>         users:
>>>
>>>          1. Exploratory - no-charge access to up to 16 cores
>>>          2. Priority - $750/quarter for access to up to 192 cores
>>>             (and with a GrpTRESRunMins=cpu limit). Each user has
>>>             their own QoS
>>>          3. Condo - an investigator group who paid for nodes added
>>>             to the cluster. The group has its own QoS and SLURM
>>>             Account. The QoS allows use of the number of cores
>>>             purchased and has a much higher priority than the QoS'
>>>             of the "priority" users.
>>>
>>>         The first problem with this scheme is that condo users who
>>>         have purchased the older hardware now have access to the
>>>         newest without penalty. In addition, we're encountering
>>>         resistance to the idea of turning off their hardware and
>>>         terminating their condos (despite MOUs stating a 5yr life).
>>>         The pushback is the stated belief that the hardware should
>>>         run until it dies.
>>>
>>>         What I propose is a new TRES called a Processor Performance
>>>         Unit (PPU) that would be specified on the Node line in
>>>         slurm.conf, and used such that GrpTRES=ppu=N was calculated
>>>         as the number of allocated cores multiplied by their
>>>         associated PPU numbers.
>>>
>>>         We could then assign a base PPU to the oldest hardware, say,
>>>         "1" for Sandy/Ivy and increase for later architectures based
>>>         on performance improvement. We'd set the condo QoS to
>>>         GrpTRES=ppu=N*X+M*Y,..., where N is the number of cores of
>>>         the oldest architecture multiplied by the configured
>>>         PPU/core, X, and repeat for any newer nodes/cores the
>>>         investigator has purchased since.
>>>
>>>         The result is that the investigator group gets to run on an
>>>         approximation of the performance that they've purchased,
>>>         rather on the raw purchased core count.
>>>
>>>         Thoughts?
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190620/554c8cbe/attachment-0001.html>