<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div>Hi Paddy,</div><div><br></div><div>Why don't you add new QoS's and add them as partition QoS for each partition, and then set the defaults on those partition QoS?</div><div><br></div><div>Like</div><div><br></div><div>sacctmgr add qos cloud</div><div><br></div><div>PartitionName=cloud Nodes=node[1-6] Default=YES MaxTime=30-0 DefaultTime=0:10:0 State=DOWN  QoS=cloud <br></div><div><br></div><div>That way you could have different QoS names for all the partitions across all of your clusters, and set the limits on the QoS?</div><div><br></div><div>Sean</div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">--<br>Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>Research Computing Services | Business Services<br>The University of Melbourne, Victoria 3010 Australia<br><br></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, 20 Jun 2020 at 07:24, Paddy Doyle <<a href="mailto:paddy@tchpc.tcd.ie">paddy@tchpc.tcd.ie</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.<br>

<br>

Hi all,<br>

<br>

I've been trying to understand how to properly set a limit on the number of<br>

cores a user (or an association is fine either) can have in use at any one<br>

time.<br>

<br>

Ideally, I'd like to be able to set a default value once for the cluster,<br>

and then have it inherit down to lots of associations and users. And there<br>

are multiple clusters that need such a limit.<br>

<br>

Our setup has a single shared Slurmdbd, with multiple clusters connected<br>

back to it (I think that's important for QOS-based solutions).<br>

<br>

Most of the previous mails about this on the list (I know it's come up many<br>

times before) talk about QOS-based solutions, but the problem is that the<br>

QOS limits are global across all clusters, and so we can't use them like<br>

that.<br>

<br>

I've tried lots of different sacctmgr options on a test cluster, and can't<br>

seem to get it right. Any help would be really appreciated!<br>

<br>

<br>

I'll go through what I've tried:<br>

<br>

<br>

MaxJobs: this is not right, as it limits the jobs, not the number of cores.<br>

So a user can have lots of high-core-count jobs.<br>

<br>

<br>

  sacctmgr update qos normal set maxtresperuser=cpu=32<br>

<br>

That will work.. except that QOS is global across all of the<br>

slurmdbd-connected clusters. So unless every cluster is of the same size<br>

and the policies need to be the same, it won't work in practice.<br>

<br>

<br>

  sacctmgr update account cluster=C1 set MaxTRES=cpu=32 where account=A1<br>

<br>

That limit is per-job, not per user.<br>

<br>

<br>

  sacctmgr update account cluster=C1  set GrpTRES=cpu=32<br>

<br>

That limits a max of 32 cores in use over the entire cluster, so that's not<br>

right.<br>

<br>

<br>

  sacctmgr update account cluster=C1  set GrpTRES=cpu=32 where account=A1<br>

<br>

That will work alright for *that* account.<br>

<br>

But the idea of having to do this for many 10s of accounts doesn't leave me<br>

too happy. And we would have to make it part of a new account workflow. And<br>

any future policy changes would have to be reset individually for all<br>

existing accounts.<br>

<br>

<br>

Is there some other way that I've missed?<br>

<br>

Thanks!<br>

<br>

Paddy<br>

<br>

-- <br>

Paddy Doyle<br>

Research IT / Trinity Centre for High Performance Computing,<br>

Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.<br>

Phone: +353-1-896-3725<br>

<a href="https://www.tchpc.tcd.ie/" rel="noreferrer" target="_blank">https://www.tchpc.tcd.ie/</a><br>

<br>

</blockquote></div>