<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">I’m perplexed. My cluster has been churning along and tonight it has decided to start pending jobs even though there are plenty of nodes available.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">An example job from squeue:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409978 interacti verdi amirinen PD 0:00 1 (Resources)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409989 regress update_r jenkins PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409985 regress update_r amirinen PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409982 regress update_r akshabal PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409994 regress SYN__tpb kumarbck PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409999 interacti sbatch_w akshabal PD 0:00 1 (Priority)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410000 regress ICC2__tp gadikon PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410005 regress update_r amirinen PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410003 regress update_r bachchuk PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410006 regress update_r saurahuj PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410009 regress xterm_fi gadikon PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410010 regress ICC2__tp gadikon PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410001 regress ICC2__tp gadikon PD 0:00 1 (Dependency)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410002 regress ICC2__tp gadikon PD 0:00 1 (Dependency)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410004 regress ICC2__tp gadikon PD 0:00 1 (Dependency)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410011 regress ICC2__tp gadikon PD 0:00 1 (Dependency)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410014 regress ICC2__tp gadikon PD 0:00 1 (Dependency)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 410015 regress ICC2__tp gadikon PD 0:00 1 (Dependency)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> 409937 interacti verdi nsamra R 5:51:10 1 c7-c5n-18xl-3<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The output of sinfo shows plenty of nodes available for the scheduler.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">PARTITION AVAIL TIMELIMIT NODES STATE NODELIST<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">all up infinite 31954 idle~ al2-t3-2xl-[0-999],al2-t3-l-[0-999],c7-c5-24xl-[0-5,7-10,14,16-17,19,21-46,48-151,153-155,157-164,167,169-485,487-999],c7-c5d-24xl-[0,2-999],c7-c5n-18xl-[0-2,4-14,16-26,28-44,46,48-51,53-54,56-63,65-67,69-72,74-75,77-82,84,86-99,101-999],c7-m5-24xl-[0-325,327-999],c7-m5d-24xl-[0-191,193-999],c7-m5dn-24xl-[0-3,5-97,99-999],c7-m5n-24xl-[0-24,26-999],c7-r5d-16xl-[0-3,5-999],c7-r5d-24xl-[1-16,18-999],c7-r5dn-24xl-[0-1,3-999],c7-t3-2xl-[0-8,10-970,973-999],c7-t3-l-[0-999],c7-x1-32xl-[0-6,8-999],c7-x1e-32xl-[0-999],c7-z1d-12xl-[0,2-5,7,9-10,12-999],rh7-c5-24xl-[0-999],rh7-c5d-24xl-[0-999],rh7-c5n-18xl-[0-999],rh7-m5-24xl-[0-999],rh7-m5d-24xl-[0-999],rh7-m5dn-24xl-[0-999],rh7-m5n-24xl-[0-999],rh7-r5d-16xl-[0-999],rh7-r5d-24xl-[0-999],rh7-r5dn-24xl-[0-999],rh7-t3-2xl-[0-999],rh7-t3-l-[0-999],rh7-x1-32xl-[0-999],rh7-x1e-32xl-[0-999],rh7-z1d-12xl-[0-999]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">all up infinite 2 drain c7-t3-l-s-0,rh7-t3-l-s-0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">all up infinite 46 mix c7-c5-24xl-[6,11-13,15,18,20,47,152,156,165-166,168,486],c7-c5d-24xl-1,c7-c5n-18xl-[3,15,27,45,47,52,55,64,68,73,76,83,85,100],c7-m5-24xl-326,c7-m5d-24xl-192,c7-m5dn-24xl-[4,98],c7-m5n-24xl-25,c7-r5d-16xl-4,c7-r5d-24xl-[0,17],c7-r5dn-24xl-2,c7-t3-2xl-[9,971-972],c7-x1-32xl-7,c7-z1d-12xl-[1,6,8,11]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">all up infinite 1 idle al2-t3-l-s-0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">The job isn’t requesting anything special. Just 1 core and 1G of memory.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">Any thoughts on why the scheduler would just stop scheduling jobs? This cluster is running on AWS and it’s my intention to provide enough nodes so that jobs never queue and so far it’s been working
until now.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">I’ve tried restarting slurmctld with an increased logging level, but no progress.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">I see the following messages in slurmctld.log<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">[2020-03-29T21:55:58.951] debug: sched: Running job scheduler<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">[2020-03-29T21:55:58.953] debug: sched: JobId=409999. State=PENDING. Reason=Priority, Priority=100013. Partition=interactive.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">[2020-03-29T21:56:58.932] debug: sched: Running job scheduler<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">[2020-03-29T21:56:58.934] debug: sched: JobId=409999. State=PENDING. Reason=Priority, Priority=100013. Partition=interactive.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">The output of scontrol for this job is:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">JobId=409999 JobName=sbatch_wrap.sh<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> UserId=akshabal(67674) GroupId=domain_users(66049) MCS_label=N/A<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Priority=100013 Nice=0 Account=(null) QOS=normal<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> JobState=PENDING Reason=Priority Dependency=(null)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> SubmitTime=2020-03-29T19:34:09 EligibleTime=2020-03-29T19:34:09<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> AccrueTime=2020-03-29T19:34:09<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> StartTime=2020-03-29T21:51:27 EndTime=Unknown Deadline=N/A<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-29T21:50:58<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Partition=interactive AllocNode:Sid=a-2vaol6a8g9ca8.mla.annapurna.aws.a2z.com:7549<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> ReqNodeList=(null) ExcNodeList=(null)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> NodeList=(null)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> TRES=cpu=8,mem=32G,node=1,billing=8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> MinCPUsNode=8 MinMemoryNode=32G MinTmpDiskNode=0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Features=(null) DelayBoot=00:00:00<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Command=/tools/slurm/bin/sbatch_wrap.sh jg source.tcl<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> WorkDir=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> StdErr=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf/slurm-409999.out<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> StdIn=/dev/null<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> StdOut=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf/slurm-409999.out<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""> Power=<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New"">How do I go about debugging this?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier New""><o:p> </o:p></span></p>
</div>
</body>
</html>