<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
Ivan,<br>
<br>
Are you having I/O slowness? That is the most common cause for us.
If it's not that, you'll want to look through all the reasons that
it takes a long time for a process to actually die after a SIGKILL
because one of those is the likely cause. Typically it's because the
process is waiting for an I/O syscall to return. Sometimes swap
death is the culprit, but usually not at the scale that you stated.
Maybe you could try reproducing the issue manually or putting
something in epilog the see the state of the processes in the job's
cgroup.<br>
<br>
Ryan<br>
<br>
<div class="moz-cite-prefix">On 7/22/20 10:24 AM, Ivan Kovanda
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:MWHPR11MB006111D9FE2985B00DFC220BF7790@MWHPR11MB0061.namprd11.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:black;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><span style="color:black">Dear slurm
community,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">Currently running
slurm version 18.08.4<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">We have been
experiencing an issue causing any nodes a slurm job was
submitted to to "drain".<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">From what I've
seen, it appears that there is a problem with how slurm is
cleaning up the job with the SIGKILL process.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">I've found this
slurm article
(<a class="moz-txt-link-freetext" href="https://slurm.schedmd.com/troubleshoot.html#completing">https://slurm.schedmd.com/troubleshoot.html#completing</a>) ,
which has a section titled "Jobs and nodes are stuck in
COMPLETING state", where it recommends increasing the
"UnkillableStepTimeout" in the slurm.conf , but all that has
done is prolong the time it takes for the job to timeout.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">The default time
for the "UnkillableStepTimeout" is 60 seconds.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">After the job
completes, it stays in the CG (completing) status for the 60
seconds, then the nodes the job was submitted to go to drain
status.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">On the headnode
running slurmctld, I am seeing this in the log -
/var/log/slurmctld:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">--------------------------------------------------------------------------------------------------------------------------------------------<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.000]
update_node: node node001 reason set to: Kill task failed<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.001]
update_node: node node001 state set to DRAINING<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">On the compute
node, I am seeing this in the log - /var/log/slurmd<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">--------------------------------------------------------------------------------------------------------------------------------------------<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.110]
[1485.batch] done with job<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.110]
[1485.extern] Sent signal 18 to 1485.4294967295<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.111]
[1485.extern] Sent signal 15 to 1485.4294967295<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:39:02.820]
[1485.extern] Sent SIGKILL signal to 1485.4294967295<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.000]
[1485.extern] error: *** EXTERN STEP FOR 1485 STEPD
TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT
ENDING WITH SIGNALS ***<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">I've tried
restarting the SLURMD daemon on the compute nodes, and even
completing rebooting a few computes nodes (node001, node002)
.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">From what I've
seen were experiencing this on all nodes in the cluster.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">I've yet to
restart the headnode because there are still active jobs on
the system so I don't want to interrupt those.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:black">Thank you for
your time,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:black">Ivan<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</blockquote>
<br>
</body>
</html>