<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Dear Jurgen,</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Thank you for your reply. So, in respond to your suggestion I submitted a batch of jobs each asking for 2 cpus. Again I was able to get 32 jobs running at once. I presume this is a weird interaction with the normal QOS. In that respect would it be best to redefine

 the normal OQS simply in terms of cpu/user usage? That is, not cpus/user and nodes/user.</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Best regards,</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

David</div>

<div id="appendonsend"></div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Juergen Salk <juergen.salk@uni-ulm.de><br>

<b>Sent:</b> 25 September 2019 14:52<br>

<b>To:</b> Slurm User Community List <slurm-users@lists.schedmd.com><br>

<b>Subject:</b> Re: [slurm-users] Advice on setting a partition QOS</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt">

<div class="PlainText">Dear David,<br>

<br>

as it seems, Slurm counts allocated nodes on a per job basis, <br>

i.e. every individual one-core jobs counts as an additional node<br>

even if they all run on one and the same node. <br>

<br>

Can you allocate 64 CPUs at the same time when requesting 2 CPUs<br>

per job?<br>

<br>

We've also had this (somewhat strange) behaviour with Moab and<br>

therefore implemented limits based on processor counts rather <br>

than node counts per user. This is obviously no issue for exclusive <br>

node scheduling, but for non-exclusive nodes it is (or at least may<br>

be). <br>

<br>

Best regards<br>

Jürgen<br>

<br>

-- <br>

Jürgen Salk<br>

Scientific Software & Compute Services (SSCS)<br>

Kommunikations- und Informationszentrum (kiz)<br>

Universität Ulm<br>

Telefon: +49 (0)731 50-22478<br>

Telefax: +49 (0)731 50-22471<br>

<br>

<br>

<br>

<br>

* David Baker <D.J.Baker@soton.ac.uk> [190925 12:12]:<br>

> Hello,<br>

> <br>

> I have defined a partition and corresponding QOS in Slurm. This is<br>

> the serial queue to which we route jobs that require up to (and<br>

> including) 20 cpus. The nodes controlled by serial are shared. I've<br>

> set the QOS like so..<br>

> <br>

> [djb1@cyan53 slurm]$ sacctmgr show qos serial format=name,maxtresperuser<br>

>       Name     MaxTRESPU<br>

> ---------- -------------<br>

>     serial       cpu=120<br>

> <br>

> The max cpus/user is set high to try to ensure (as often as<br>

> possible) that the nodes are all busy and not in mixed states.<br>

> Obviously this cannot be the case all the time -- depending upon<br>

> memory requirements, etc.<br>

> <br>

> I noticed that a number of jobs were pending with the reason<br>

> QOSMaxNodePerUserLimit. I've tried firing test jobs to the queue<br>

> myself and noticed that I can never have more than 32 jobs running<br>

> (each requesting 1 cpu) and the rest are pending as per the reason<br>

> above. Since the QOS cpu/user limit is set to 120 I would expect to<br>

> be able to run more jobs -- given that some serial nodes are still<br>

> not fully occupied. Furthermore, I note that other users appear not<br>

> to be able to use more then 32 cpus in the queue.<br>

> <br>

> The 32 limit does make a degree of sense. The "normal" QOS is set to<br>

> cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the<br>

> serial queue are being counted as nodes -- as per the pending<br>

> reason.<br>

> <br>

> Could someone please help me understand this issue and how to avoid it?<br>

> <br>

> Best regards,<br>

> David<br>

<br>

<br>

</div>

</span></font></div>

</body>

</html>