[slurm-users] [EXT] Set a per-cluster default limit of the number of active cores per user at a time

Paddy Doyle paddy at tchpc.tcd.ie
Mon Jun 22 17:17:45 UTC 2020


Hi Sean,

That sounds like a workable solution, thanks for the suggestion!

I was hoping that there was something else I'd missed in the docs that
let's you do it directly via sacctmgr without having to edit slurm.conf as
well.

Thanks again,
Paddy

On Sat, Jun 20, 2020 at 09:20:02AM +1000, Sean Crosby wrote:

> Hi Paddy,
> 
> Why don't you add new QoS's and add them as partition QoS for each
> partition, and then set the defaults on those partition QoS?
> 
> Like
> 
> sacctmgr add qos cloud
> 
> PartitionName=cloud Nodes=node[1-6] Default=YES MaxTime=30-0
> DefaultTime=0:10:0 State=DOWN  QoS=cloud
> 
> That way you could have different QoS names for all the partitions across
> all of your clusters, and set the limits on the QoS?
> 
> Sean
> 
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
> 
> 
> 
> On Sat, 20 Jun 2020 at 07:24, Paddy Doyle <paddy at tchpc.tcd.ie> wrote:
> 
> > UoM notice: External email. Be cautious of links, attachments, or
> > impersonation attempts.
> >
> > Hi all,
> >
> > I've been trying to understand how to properly set a limit on the number of
> > cores a user (or an association is fine either) can have in use at any one
> > time.
> >
> > Ideally, I'd like to be able to set a default value once for the cluster,
> > and then have it inherit down to lots of associations and users. And there
> > are multiple clusters that need such a limit.
> >
> > Our setup has a single shared Slurmdbd, with multiple clusters connected
> > back to it (I think that's important for QOS-based solutions).
> >
> > Most of the previous mails about this on the list (I know it's come up many
> > times before) talk about QOS-based solutions, but the problem is that the
> > QOS limits are global across all clusters, and so we can't use them like
> > that.
> >
> > I've tried lots of different sacctmgr options on a test cluster, and can't
> > seem to get it right. Any help would be really appreciated!
> >
> >
> > I'll go through what I've tried:
> >
> >
> > MaxJobs: this is not right, as it limits the jobs, not the number of cores.
> > So a user can have lots of high-core-count jobs.
> >
> >
> >   sacctmgr update qos normal set maxtresperuser=cpu=32
> >
> > That will work.. except that QOS is global across all of the
> > slurmdbd-connected clusters. So unless every cluster is of the same size
> > and the policies need to be the same, it won't work in practice.
> >
> >
> >   sacctmgr update account cluster=C1 set MaxTRES=cpu=32 where account=A1
> >
> > That limit is per-job, not per user.
> >
> >
> >   sacctmgr update account cluster=C1  set GrpTRES=cpu=32
> >
> > That limits a max of 32 cores in use over the entire cluster, so that's not
> > right.
> >
> >
> >   sacctmgr update account cluster=C1  set GrpTRES=cpu=32 where account=A1
> >
> > That will work alright for *that* account.
> >
> > But the idea of having to do this for many 10s of accounts doesn't leave me
> > too happy. And we would have to make it part of a new account workflow. And
> > any future policy changes would have to be reset individually for all
> > existing accounts.
> >
> >
> > Is there some other way that I've missed?
> >
> > Thanks!
> >
> > Paddy
> >
> > --
> > Paddy Doyle
> > Research IT / Trinity Centre for High Performance Computing,
> > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> > Phone: +353-1-896-3725
> > https://www.tchpc.tcd.ie/
> >
> >

-- 
Paddy Doyle
Research IT / Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
https://www.tchpc.tcd.ie/



More information about the slurm-users mailing list