<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Title" content="">
<meta name="Keywords" content="">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.msoIns
{mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
color:teal;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:797794077;
mso-list-type:hybrid;
mso-list-template-ids:-1194675598 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style>
</head>
<body bgcolor="white" lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">To emphasize what Thomas wrote: backfill will only be useful if users submit jobs with realistic runtime limits. If every job is submitted with a default runtime of, for example, 7 days, then Slurm will not backfill your small jobs while
it waits for the resources for the highest-priority large job. It will only backfill if it can do so without delaying the start of the highest priority job:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Slurm needs resources to run Job A. It looks at currently running jobs, they all have a runtime of < 7 days.
<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Slurm looks at runtime of jobs B, C, etc. queued behind Job A. They all need 7 days.<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">Slurm figures that starting any of jobs B, C, etc., will push back the start of Job A to at least 7 days; if it just waits for current jobs to finish, Job A will start in < 7 days.
So it never backfills jobs B, C, etc.<o:p></o:p></li></ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Training users to submit jobs with realistic runtime limits is a User Education Opportunity.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">John<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of "Thomas M. Payerle" <payerle@umd.edu><br>
<b>Reply-To: </b>Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Date: </b>Tuesday, July 9, 2019 at 10:23 AM<br>
<b>To: </b>Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Subject: </b>Re: [slurm-users] Jobs waiting while plenty of cpu and memory available<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<table class="MsoNormalTable" border="0" cellpadding="0" width="99%" style="width:99.0%">
<tbody>
<tr>
<td style="background:#394A58;padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="font-size:12.0pt;color:white">[WARNING: External Email - Use Caution]</span><o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">You can use squeue to see the priority of jobs. I believe it normally shows jobs in order of priority, even though does not display priority. If you want to see actual priority, you need to request it in the format field. I typically
use<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">squeue -o "%.18i %.12a %.6P %.8u %.2t %.8m %.4D %.4C %12l %12p %Q %b %R" <any other squeue options><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Do you have backfill enabled? This can help in many cases.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">If the job with highest priority is quite wide, Slurm will reserve resources for it. E.g., if it requests all of your nodes, then Slurm will reserve all nodes as they become idle for the wide job, until no other jobs are running and it
can finally run. W/out backfill, no other jobs will run before it. With backfill, Slurm will estimate when all the nodes needed for the highest priority job to run will be available (based on walltime limits of running jobs), and will allow other jobs to
run on the reserved nodes (backfill) as long as they will complete (based on their walltime limits) before Slurm expects the remaining nodes for the top priority job will be available. This can greatly improve utilization of the cluster --- I suspect a large
percentage of our jobs run as backfill.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Tue, Jul 9, 2019 at 10:10 AM Edward Ned Harvey (slurm) <<a href="mailto:slurm@nedharvey.com">slurm@nedharvey.com</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal" style="margin-bottom:12.0pt">> From: slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> On Behalf Of<br>
> Ole Holm Nielsen<br>
> Sent: Tuesday, July 9, 2019 2:36 AM<br>
> <br>
> When some jobs are pending with Reason=Priority this means that other<br>
> jobs with a higher priority are waiting for the same resources (CPUs) to<br>
> become available, and they will have Pending=Resources in the squeue<br>
> output.<br>
<br>
Yeah, that's exactly the problem. There are plenty of cpu and memory resources available, yet jobs are waiting. Is there any way to know what resources, specifically, the jobs are waiting for, or what jobs are ahead of a particular job in queue, so I can then
look at what resources the first job requires? "scontrol show partition" doesn't reveal any clear problems:<br>
<br>
PartitionName=batch<br>
AllowGroups=ALL AllowAccounts=ALL DenyQos=foo,bar,baz<br>
AllocNodes=ALL Default=YES QoS=N/A<br>
DefaultTime=00:15:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO<br>
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED<br>
Nodes=alpha[003-068],omega[003-068]<br>
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO<br>
OverTimeLimit=NONE PreemptMode=REQUEUE<br>
State=UP TotalCPUs=4321 TotalNodes=123 SelectTypeParameters=NONE<br>
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED<br>
<br>
The QoS policies are not new, and have not changed recently, yet the problem of jobs pending is a new problem. I can't seem to get any information about why they're pending.<br>
<br>
<o:p></o:p></p>
</blockquote>
</div>
<p class="MsoNormal"><br clear="all">
<br>
-- <o:p></o:p></p>
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">Tom Payerle <br>
DIT-ACIGS/Mid-Atlantic Crossroads <a href="mailto:payerle@umd.edu" target="_blank">
payerle@umd.edu</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">5825 University Research Park (301) 405-6135<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">University of Maryland<br>
College Park, MD 20740-3831<o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>