[slurm-users] Proposal for new TRES - "Processor Performance Units"....

John Hearns hearnsj at googlemail.com
Thu Jun 20 14:32:40 UTC 2019

Paul, you refer to banking resources. Which leads me to ask are schemes
such as Gold used these days in Slurm?
Gold was a utility where groups could top up with a virtual amount of money
which would be spent as they consume resources.
Altair also wrote a similar system for PBS, which they offered to us when I
was in Formula 1 - it was quite a good system, and at the time
we had a requirement for allocating resources to groups of users.

I guess the sophisticated fairshare mechanisms discussed in this thread
make schemes like Gold obsolete.

On Thu, 20 Jun 2019 at 15:24, Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> People will specify which partition they need or if they want multiple
> they use this:
> #SBATCH -p general,shared,serial_requeue
> As then the scheduler will just select which partition they will run in
> first.  Naturally there is a risk that you will end up running in a more
> expensive partition.
> Our time limit is only applied to our public partitions, our owned
> partitions (of which we have roughly 80) have no time limit.  So if they
> run on their dedicated resources they have no penalty.  We've been working
> on getting rid of owned partitions and moving to a school/department based
> partition, where all the purchased resources for different PI's go into the
> same bucket where they compete against themselves and not the wider
> community.  We've found that this ends up working pretty well as most PI's
> only used their purchased resources sporadically.  Thus there are usually
> idle cores lying around that we backfill with our serial queues.  Since
> those are requeueable we can get immediate response to access that idle
> space.  We are also toying with a high priority partition that is open to
> people with high fairshare so that they can get immediate response as those
> with high fairshare tend to be bursty users.
> Our current halflife is set to a month and we keep 6 months of data in our
> database.  I'd actually like to get rid of the halflife and just go to a 3
> month moving window to allow people to bank their fairshare, but we haven't
> done that yet as people have been having a hard enough time understanding
> our current system.  It's not due to its complexity but more that most
> people just flat out aren't cognizant of their usage and think the resource
> is functionally infinite.
> -Paul Edmon-
> On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
> Hi Paul,
> Thanks..Your setup is interesting. I see that you have your processor
> types segregated in their own partitions (with the exception of of the
> requeue partition), and that's how you get at the weighting mechanism. Do
> you have your users explicitly specify multiple partitions in the batch
> commands/scripts in order to take advantage of this, or do you use a plugin
> for it?
> It sounds like you don't impose any hard limit on simultaneous resource
> use, and allow everything to fairshare out with the help of the 7 day
> TimeLimit. We haven't been imposing any TimeLimit on our condo users, which
> would be an issue for us with your config. For our exploratory and priority
> users, we impose an effective time limit with GrpTRESRunMins=cpu (and
> gres/gpu= for the GPU usage). In addition, since we have so many priority
> users, we don't explicitly set a rawshare value for them (they all execute
> under the "default" account). We set rawshare for the condo accounts as
> cores-purchased/total-cores*1000.
> What's your fairshare decay setting (don't remember the proper name at the
> moment)?
> Regards,
> Sam
> On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon <pedmon at cfa.harvard.edu> wrote:
>> We do a similar thing here at Harvard:
>> https://www.rc.fas.harvard.edu/fairshare/
>> We simply weight all the partitions based on their core type and then we
>> allocate Shares for each account based on what they have purchased.  We
>> don't use QoS at all, so we just rely purely on fairshare weighting for
>> resource usage.  It has worked pretty well for our purposes.
>> -Paul Edmon-
>> On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>> (...and yes, the name is inspired by a certain OEM's software licensing
>> schemes...)
>> At Brown we run a ~400 node cluster containing nodes of multiple
>> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
>> some cases by University funds and in others by investigator funding
>> (~50:50).  They all appear in the default SLURM partition. We have 3
>> classes of SLURM users:
>>    1. Exploratory - no-charge access to up to 16 cores
>>    2. Priority - $750/quarter for access to up to 192 cores (and with a
>>    GrpTRESRunMins=cpu limit). Each user has their own QoS
>>    3. Condo - an investigator group who paid for nodes added to the
>>    cluster. The group has its own QoS and SLURM Account. The QoS allows use of
>>    the number of cores purchased and has a much higher priority than the QoS'
>>    of the "priority" users.
>> The first problem with this scheme is that condo users who have purchased
>> the older hardware now have access to the newest without penalty. In
>> addition, we're encountering resistance to the idea of turning off their
>> hardware and terminating their condos (despite MOUs stating a 5yr life).
>> The pushback is the stated belief that the hardware should run until it
>> dies.
>> What I propose is a new TRES called a Processor Performance Unit (PPU)
>> that would be specified on the Node line in slurm.conf, and used such that
>> GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by
>> their associated PPU numbers.
>> We could then assign a base PPU to the oldest hardware, say, "1" for
>> Sandy/Ivy and increase for later architectures based on performance
>> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is
>> the number of cores of the oldest architecture multiplied by the configured
>> PPU/core, X, and repeat for any newer nodes/cores the investigator has
>> purchased since.
>> The result is that the investigator group gets to run on an approximation
>> of the performance that they've purchased, rather on the raw purchased core
>> count.
>> Thoughts?
