<div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:monospace">(sorry, kind of fell asleep on you there...)</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I wouldn't expect backfill to be a problem since it shouldn't be starting jobs that won't complete before the priority reservations start. We allow jobs to go over (overtimelimit) so in our case it can be a problem.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">On one of our cloud clusters we had problems with large jobs getting starved so we set "assoc_limit_stop" in the scheduler parameters- I think for your config it would require removing "assoc_limit_continue" (we're on Slurm 18 and _continue is the default, replaced by _stop if you want that behavior). However, there we use the builtin scheduler- I'd imagine this would play heck with a fairshare/backfill cluster (like our on-campus) though. However, it is designed to prevent large-job starvation.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">We'd also had some issues with fairshare hitting the limit pretty quickly- basically it stopped being a useful factor in calculating priority- so we set FairShareDampeningFactor to 5 to get a little more utility out of that.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I'd suggest looking at the output of sprio to see how your factors are working in situ, particularly when you've got a stuck large job. It may be that the SMALL_RELATIVE_TO_TIME could be washing out the job size factor if your larger jobs are also longer.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">HTH.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">M</div><div class="gmail_default" style="font-family:monospace"><br></div><input name="virtru-metadata" type="hidden" value="{"email-policy":{"state":"closed","expirationUnit":"days","disableCopyPaste":false,"disablePrint":false,"disableForwarding":false,"enableNoauth":false,"expires":false,"isManaged":false},"attachments":{},"compose-id":"5","compose-window":{"secure":false}}"></div></div><br><div class="gmail_quote" style=""><div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019 at 2:46 AM David Baker <<a href="mailto:D.J.Baker@soton.ac.uk">D.J.Baker@soton.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>Michael,</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div><font color="#000000" face="Calibri, Arial, Helvetica, sans-serif"><span style="font-size:12pt">Thank you for your reply and your thoughts. These are the priority weights that I have configured in the slurm.conf. </span></font></div>
<div><font color="#000000" face="Calibri, Arial, Helvetica, sans-serif"><span style="font-size:12pt"><br>
</span></font></div>
<div><font color="#000000" face="Calibri, Arial, Helvetica, sans-serif"><span style="font-size:12pt"><span>PriorityWeightFairshare=1000000<br>
</span>
<div>PriorityWeightAge=100000<br>
</div>
<div>PriorityWeightPartition=1000</div>
<div>PriorityWeightJobSize=10000000<br>
</div>
<span>PriorityWeightQOS=10000</span><br>
</span></font></div>
<div><font color="#000000" face="Calibri, Arial, Helvetica, sans-serif"><span style="font-size:12pt"><br>
</span></font></div>
<div><font color="#000000" face="Calibri, Arial, Helvetica, sans-serif"><span style="font-size:12pt">I've made the PWJobSize to be the highest factor, however I understand that that only provides a once-off kick to jobs and so it probably
</span>insignificant in the longer run<span style="font-size:12pt"> . That's followed by the PWFairshare. </span></font></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>Should I really be looking at increasing the PWAge factor to help to "push jobs" through the system? </span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>The other issue that might play a part is that we see a lot of single node jobs (presumably backfilled) into the system. Users aren't excessively bombing the cluster, but maybe some backfill throttling would be useful as well (?)</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>What are your thoughts having seen the priority factors, please? I've attached a copy of the slurm.conf just in case you or anyone else wants to take a more complete overview.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Best regards,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
David</div>
<div id="gmail-m_-4873990043200236936appendonsend"></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_-4873990043200236936divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Michael Gutteridge <<a href="mailto:michael.gutteridge@gmail.com" target="_blank">michael.gutteridge@gmail.com</a>><br>
<b>Sent:</b> 09 April 2019 18:59<br>
<b>To:</b> Slurm User Community List<br>
<b>Subject:</b> Re: [slurm-users] Effect of PriorityMaxAge on job throughput</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace"><br>
</div>
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace">It might be useful to include the various priority factors you've got configured. The fact that adjusting PriorityMaxAge had a dramatic effect suggests that the age factor is pretty high- might be
worth looking at that value relative to the other factors.</div>
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace"><br>
</div>
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace">Have you looked at PriorityWeightJobSize? Might have some utility if you're finding large jobs getting short-shrift.</div>
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace"><br>
</div>
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace"> - Michael</div>
<div class="gmail-m_-4873990043200236936x_" style="font-family:monospace"><br>
</div>
<input name="x_virtru-metadata" type="hidden"></div>
</div>
<br>
<div class="gmail-m_-4873990043200236936x_gmail_quote">
<div dir="ltr" class="gmail-m_-4873990043200236936x_gmail_attr">On Tue, Apr 9, 2019 at 2:01 AM David Baker <<a href="mailto:D.J.Baker@soton.ac.uk" target="_blank">D.J.Baker@soton.ac.uk</a>> wrote:<br>
</div>
<blockquote class="gmail-m_-4873990043200236936x_gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hello,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I've finally got the job throughput/turnaround to be reasonable in our cluster. Most of the time the job activity on the cluster sets the default QOS to 32 nodes (there are 464 nodes in the default queue). Jobs requesting nodes close to the QOS level (for example
22 nodes) are scheduled within 24 hours which is better than it has been. Still I suspect there is room for improvement. I note that these large jobs still struggle to be given a starttime, however many jobs are now being given a starttime following my SchedulerParameters
makeover.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I used advice from the mailing list and the Slurm high throughput document to help me make changes to the scheduling parameters. They are now...</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Also..</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>PriorityFavorSmall=NO<br>
</span>
<div>PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE<br>
</div>
<div>PriorityType=priority/multifactor<br>
</div>
<span>PriorityDecayHalfLife=7-0</span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>PriorityMaxAge=1-0<br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 1-0. Before that change the larger jobs could hang around in the queue for days. Does it make sense therefore to further reduce PriorityMaxAge to less than 1 day? Your advice
would be appreciated, please.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>Best regards,</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span>David</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote></div></div>