[slurm-users] Proposal for new TRES - "Processor Performance Units"....

Paul Edmon pedmon at cfa.harvard.edu
Thu Jun 20 14:20:04 UTC 2019

People will specify which partition they need or if they want multiple 
they use this:

#SBATCH -p general,shared,serial_requeue

As then the scheduler will just select which partition they will run in 
first.  Naturally there is a risk that you will end up running in a more 
expensive partition.

Our time limit is only applied to our public partitions, our owned 
partitions (of which we have roughly 80) have no time limit.  So if they 
run on their dedicated resources they have no penalty.  We've been 
working on getting rid of owned partitions and moving to a 
school/department based partition, where all the purchased resources for 
different PI's go into the same bucket where they compete against 
themselves and not the wider community.  We've found that this ends up 
working pretty well as most PI's only used their purchased resources 
sporadically.  Thus there are usually idle cores lying around that we 
backfill with our serial queues.  Since those are requeueable we can get 
immediate response to access that idle space.  We are also toying with a 
high priority partition that is open to people with high fairshare so 
that they can get immediate response as those with high fairshare tend 
to be bursty users.

Our current halflife is set to a month and we keep 6 months of data in 
our database.  I'd actually like to get rid of the halflife and just go 
to a 3 month moving window to allow people to bank their fairshare, but 
we haven't done that yet as people have been having a hard enough time 
understanding our current system. It's not due to its complexity but 
more that most people just flat out aren't cognizant of their usage and 
think the resource is functionally infinite.

-Paul Edmon-

On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
> Hi Paul,
> Thanks..Your setup is interesting. I see that you have your processor 
> types segregated in their own partitions (with the exception of of the 
> requeue partition), and that's how you get at the weighting mechanism. 
> Do you have your users explicitly specify multiple partitions in the 
> batch commands/scripts in order to take advantage of this, or do you 
> use a plugin for it?
> It sounds like you don't impose any hard limit on simultaneous 
> resource use, and allow everything to fairshare out with the help of 
> the 7 day TimeLimit. We haven't been imposing any TimeLimit on our 
> condo users, which would be an issue for us with your config. For our 
> exploratory and priority users, we impose an effective time limit with 
> GrpTRESRunMins=cpu (and gres/gpu= for the GPU usage). In addition, 
> since we have so many priority users, we don't explicitly set a 
> rawshare value for them (they all execute under the "default" 
> account). We set rawshare for the condo accounts as 
> cores-purchased/total-cores*1000.
> What's your fairshare decay setting (don't remember the proper name at 
> the moment)?
> Regards,
> Sam
> On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon <pedmon at cfa.harvard.edu 
> <mailto:pedmon at cfa.harvard.edu>> wrote:
>     We do a similar thing here at Harvard:
>     https://www.rc.fas.harvard.edu/fairshare/
>     We simply weight all the partitions based on their core type and
>     then we allocate Shares for each account based on what they have
>     purchased.  We don't use QoS at all, so we just rely purely on
>     fairshare weighting for resource usage.  It has worked pretty well
>     for our purposes.
>     -Paul Edmon-
>     On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>>     (...and yes, the name is inspired by a certain OEM's software
>>     licensing schemes...)
>>     At Brown we run a ~400 node cluster containing nodes of multiple
>>     architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade)
>>     purchased in some cases by University funds and in others by
>>     investigator funding (~50:50).  They all appear in the default
>>     SLURM partition. We have 3 classes of SLURM users:
>>      1. Exploratory - no-charge access to up to 16 cores
>>      2. Priority - $750/quarter for access to up to 192 cores (and
>>         with a GrpTRESRunMins=cpu limit). Each user has their own QoS
>>      3. Condo - an investigator group who paid for nodes added to the
>>         cluster. The group has its own QoS and SLURM Account. The QoS
>>         allows use of the number of cores purchased and has a much
>>         higher priority than the QoS' of the "priority" users.
>>     The first problem with this scheme is that condo users who have
>>     purchased the older hardware now have access to the newest
>>     without penalty. In addition, we're encountering resistance to
>>     the idea of turning off their hardware and terminating their
>>     condos (despite MOUs stating a 5yr life). The pushback is the
>>     stated belief that the hardware should run until it dies.
>>     What I propose is a new TRES called a Processor Performance Unit
>>     (PPU) that would be specified on the Node line in slurm.conf, and
>>     used such that GrpTRES=ppu=N was calculated as the number of
>>     allocated cores multiplied by their associated PPU numbers.
>>     We could then assign a base PPU to the oldest hardware, say, "1"
>>     for Sandy/Ivy and increase for later architectures based on
>>     performance improvement. We'd set the condo QoS to
>>     GrpTRES=ppu=N*X+M*Y,..., where N is the number of cores of the
>>     oldest architecture multiplied by the configured PPU/core, X, and
>>     repeat for any newer nodes/cores the investigator has purchased
>>     since.
>>     The result is that the investigator group gets to run on an
>>     approximation of the performance that they've purchased, rather
>>     on the raw purchased core count.
>>     Thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190620/d184a68a/attachment-0001.html>

More information about the slurm-users mailing list