[slurm-users] Proposal for new TRES - "Processor Performance Units"....
pedmon at cfa.harvard.edu
Thu Jun 20 14:42:32 UTC 2019
I don't know off hand. You can sort of construct a similar system in
Slurm, but I've never seen it as a native option.
On 6/20/19 10:32 AM, John Hearns wrote:
> Paul, you refer to banking resources. Which leads me to ask are
> schemes such as Gold used these days in Slurm?
> Gold was a utility where groups could top up with a virtual amount of
> money which would be spent as they consume resources.
> Altair also wrote a similar system for PBS, which they offered to us
> when I was in Formula 1 - it was quite a good system, and at the time
> we had a requirement for allocating resources to groups of users.
> I guess the sophisticated fairshare mechanisms discussed in this
> thread make schemes like Gold obsolete.
> On Thu, 20 Jun 2019 at 15:24, Paul Edmon <pedmon at cfa.harvard.edu
> <mailto:pedmon at cfa.harvard.edu>> wrote:
> People will specify which partition they need or if they want
> multiple they use this:
> #SBATCH -p general,shared,serial_requeue
> As then the scheduler will just select which partition they will
> run in first. Naturally there is a risk that you will end up
> running in a more expensive partition.
> Our time limit is only applied to our public partitions, our owned
> partitions (of which we have roughly 80) have no time limit. So
> if they run on their dedicated resources they have no penalty.
> We've been working on getting rid of owned partitions and moving
> to a school/department based partition, where all the purchased
> resources for different PI's go into the same bucket where they
> compete against themselves and not the wider community. We've
> found that this ends up working pretty well as most PI's only used
> their purchased resources sporadically. Thus there are usually
> idle cores lying around that we backfill with our serial queues.
> Since those are requeueable we can get immediate response to
> access that idle space. We are also toying with a high priority
> partition that is open to people with high fairshare so that they
> can get immediate response as those with high fairshare tend to be
> bursty users.
> Our current halflife is set to a month and we keep 6 months of
> data in our database. I'd actually like to get rid of the
> halflife and just go to a 3 month moving window to allow people to
> bank their fairshare, but we haven't done that yet as people have
> been having a hard enough time understanding our current system.
> It's not due to its complexity but more that most people just flat
> out aren't cognizant of their usage and think the resource is
> functionally infinite.
> -Paul Edmon-
> On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
>> Hi Paul,
>> Thanks..Your setup is interesting. I see that you have your
>> processor types segregated in their own partitions (with the
>> exception of of the requeue partition), and that's how you get at
>> the weighting mechanism. Do you have your users explicitly
>> specify multiple partitions in the batch commands/scripts in
>> order to take advantage of this, or do you use a plugin for it?
>> It sounds like you don't impose any hard limit on simultaneous
>> resource use, and allow everything to fairshare out with the help
>> of the 7 day TimeLimit. We haven't been imposing any TimeLimit on
>> our condo users, which would be an issue for us with your config.
>> For our exploratory and priority users, we impose an effective
>> time limit with GrpTRESRunMins=cpu (and gres/gpu= for the GPU
>> usage). In addition, since we have so many priority users, we
>> don't explicitly set a rawshare value for them (they all execute
>> under the "default" account). We set rawshare for the condo
>> accounts as cores-purchased/total-cores*1000.
>> What's your fairshare decay setting (don't remember the proper
>> name at the moment)?
>> On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon
>> <pedmon at cfa.harvard.edu <mailto:pedmon at cfa.harvard.edu>> wrote:
>> We do a similar thing here at Harvard:
>> We simply weight all the partitions based on their core type
>> and then we allocate Shares for each account based on what
>> they have purchased. We don't use QoS at all, so we just rely
>> purely on fairshare weighting for resource usage. It has
>> worked pretty well for our purposes.
>> -Paul Edmon-
>> On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>>> (...and yes, the name is inspired by a certain OEM's
>>> software licensing schemes...)
>>> At Brown we run a ~400 node cluster containing nodes of
>>> multiple architectures (Sandy/Ivy, Haswell/Broadwell, and
>>> Sky/Cascade) purchased in some cases by University funds and
>>> in others by investigator funding (~50:50). They all appear
>>> in the default SLURM partition. We have 3 classes of SLURM
>>> 1. Exploratory - no-charge access to up to 16 cores
>>> 2. Priority - $750/quarter for access to up to 192 cores
>>> (and with a GrpTRESRunMins=cpu limit). Each user has
>>> their own QoS
>>> 3. Condo - an investigator group who paid for nodes added
>>> to the cluster. The group has its own QoS and SLURM
>>> Account. The QoS allows use of the number of cores
>>> purchased and has a much higher priority than the QoS'
>>> of the "priority" users.
>>> The first problem with this scheme is that condo users who
>>> have purchased the older hardware now have access to the
>>> newest without penalty. In addition, we're encountering
>>> resistance to the idea of turning off their hardware and
>>> terminating their condos (despite MOUs stating a 5yr life).
>>> The pushback is the stated belief that the hardware should
>>> run until it dies.
>>> What I propose is a new TRES called a Processor Performance
>>> Unit (PPU) that would be specified on the Node line in
>>> slurm.conf, and used such that GrpTRES=ppu=N was calculated
>>> as the number of allocated cores multiplied by their
>>> associated PPU numbers.
>>> We could then assign a base PPU to the oldest hardware, say,
>>> "1" for Sandy/Ivy and increase for later architectures based
>>> on performance improvement. We'd set the condo QoS to
>>> GrpTRES=ppu=N*X+M*Y,..., where N is the number of cores of
>>> the oldest architecture multiplied by the configured
>>> PPU/core, X, and repeat for any newer nodes/cores the
>>> investigator has purchased since.
>>> The result is that the investigator group gets to run on an
>>> approximation of the performance that they've purchased,
>>> rather on the raw purchased core count.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users