<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div>Hi Paddy,</div><div><br></div><div>Why don't you add new QoS's and add them as partition QoS for each partition, and then set the defaults on those partition QoS?</div><div><br></div><div>Like</div><div><br></div><div>sacctmgr add qos cloud</div><div><br></div><div>PartitionName=cloud Nodes=node[1-6] Default=YES MaxTime=30-0 DefaultTime=0:10:0 State=DOWN QoS=cloud <br></div><div><br></div><div>That way you could have different QoS names for all the partitions across all of your clusters, and set the limits on the QoS?</div><div><br></div><div>Sean</div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">--<br>Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>Research Computing Services | Business Services<br>The University of Melbourne, Victoria 3010 Australia<br><br></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, 20 Jun 2020 at 07:24, Paddy Doyle <<a href="mailto:paddy@tchpc.tcd.ie">paddy@tchpc.tcd.ie</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.<br>
<br>
Hi all,<br>
<br>
I've been trying to understand how to properly set a limit on the number of<br>
cores a user (or an association is fine either) can have in use at any one<br>
time.<br>
<br>
Ideally, I'd like to be able to set a default value once for the cluster,<br>
and then have it inherit down to lots of associations and users. And there<br>
are multiple clusters that need such a limit.<br>
<br>
Our setup has a single shared Slurmdbd, with multiple clusters connected<br>
back to it (I think that's important for QOS-based solutions).<br>
<br>
Most of the previous mails about this on the list (I know it's come up many<br>
times before) talk about QOS-based solutions, but the problem is that the<br>
QOS limits are global across all clusters, and so we can't use them like<br>
that.<br>
<br>
I've tried lots of different sacctmgr options on a test cluster, and can't<br>
seem to get it right. Any help would be really appreciated!<br>
<br>
<br>
I'll go through what I've tried:<br>
<br>
<br>
MaxJobs: this is not right, as it limits the jobs, not the number of cores.<br>
So a user can have lots of high-core-count jobs.<br>
<br>
<br>
sacctmgr update qos normal set maxtresperuser=cpu=32<br>
<br>
That will work.. except that QOS is global across all of the<br>
slurmdbd-connected clusters. So unless every cluster is of the same size<br>
and the policies need to be the same, it won't work in practice.<br>
<br>
<br>
sacctmgr update account cluster=C1 set MaxTRES=cpu=32 where account=A1<br>
<br>
That limit is per-job, not per user.<br>
<br>
<br>
sacctmgr update account cluster=C1 set GrpTRES=cpu=32<br>
<br>
That limits a max of 32 cores in use over the entire cluster, so that's not<br>
right.<br>
<br>
<br>
sacctmgr update account cluster=C1 set GrpTRES=cpu=32 where account=A1<br>
<br>
That will work alright for *that* account.<br>
<br>
But the idea of having to do this for many 10s of accounts doesn't leave me<br>
too happy. And we would have to make it part of a new account workflow. And<br>
any future policy changes would have to be reset individually for all<br>
existing accounts.<br>
<br>
<br>
Is there some other way that I've missed?<br>
<br>
Thanks!<br>
<br>
Paddy<br>
<br>
-- <br>
Paddy Doyle<br>
Research IT / Trinity Centre for High Performance Computing,<br>
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.<br>
Phone: +353-1-896-3725<br>
<a href="https://www.tchpc.tcd.ie/" rel="noreferrer" target="_blank">https://www.tchpc.tcd.ie/</a><br>
<br>
</blockquote></div>