<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Same here.  Whenever we see rashes of Kill task failed it is
      invariably symptomatic of one of our Lustre filesystems acting up
      or being saturated.</p>
    <p>-Paul Edmon-<br>
    </p>
    <div class="moz-cite-prefix">On 7/22/2020 3:21 PM, Ryan Cox wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:3e94ec2e-3291-5617-5e0e-7739a717c561@byu.edu">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      Angelos,<br>
      <br>
      I'm glad you mentioned UnkillableStepProgram.  We meant to look at
      that a while ago but forgot about it.  That will be very useful
      for us as well, though the answer for us is pretty much always
      Lustre problems.<br>
      <br>
      Ryan<br>
      <br>
      <div class="moz-cite-prefix">On 7/22/20 1:02 PM, Angelos Ching
        wrote:<br>
      </div>
      <blockquote type="cite"
        cite="mid:A6E5FB6B-F87C-4840-9981-8B5C5A4FEA01@clustertech.com">
        <meta http-equiv="content-type" content="text/html;
          charset=UTF-8">
        Agreed. You may also want to write a script that gather the list
        of program in "D state" (kernel wait) and print their stack; and
        configure it as UnkillableStepProgram so that you can capture
        the program and relevant system callS that caused the job to
        become unkillable / timed out exiting for further
        troubleshooting.
        <div><br>
          Regards,</div>
        <div>Angelos<br>
          <div dir="ltr">(Sent from mobile, please pardon me for typos
            and cursoriness.)</div>
          <div dir="ltr"><br>
            <blockquote type="cite">2020/07/23 0:41、Ryan Cox <a
                class="moz-txt-link-rfc2396E"
                href="mailto:ryan_cox@byu.edu" moz-do-not-send="true"><ryan_cox@byu.edu></a>のメール:<br>
              <br>
            </blockquote>
          </div>
          <blockquote type="cite">
            <div dir="ltr">
              <meta http-equiv="Content-Type" content="text/html;
                charset=UTF-8">
              Ivan,<br>
              <br>
              Are you having I/O slowness? That is the most common cause
              for us. If it's not that, you'll want to look through all
              the reasons that it takes a long time for a process to
              actually die after a SIGKILL because one of those is the
              likely cause. Typically it's because the process is
              waiting for an I/O syscall to return. Sometimes swap death
              is the culprit, but usually not at the scale that you
              stated.  Maybe you could try reproducing the issue
              manually or putting something in epilog the see the state
              of the processes in the job's cgroup.<br>
              <br>
              Ryan<br>
              <br>
              <div class="moz-cite-prefix">On 7/22/20 10:24 AM, Ivan
                Kovanda wrote:<br>
              </div>
              <blockquote type="cite"
cite="mid:MWHPR11MB006111D9FE2985B00DFC220BF7790@MWHPR11MB0061.namprd11.prod.outlook.com">
                <meta http-equiv="Content-Type" content="text/html;
                  charset=UTF-8">
                <meta name="Generator" content="Microsoft Word 15
                  (filtered medium)">
                <style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:black;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
                <div class="WordSection1">
                  <p class="MsoNormal"><span style="color:black">Dear
                      slurm community,<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">Currently
                      running slurm version 18.08.4<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">We have
                      been experiencing an issue causing any nodes a
                      slurm job was submitted to to "drain".<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">From
                      what I've seen, it appears that there is a problem
                      with how slurm is cleaning up the job with the
                      SIGKILL process.<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">I've
                      found this slurm article (<a
                        class="moz-txt-link-freetext"
                        href="https://slurm.schedmd.com/troubleshoot.html#completing"
                        moz-do-not-send="true">https://slurm.schedmd.com/troubleshoot.html#completing</a>)
                      , which has a section titled "Jobs and nodes are
                      stuck in COMPLETING state", where it recommends
                      increasing the "UnkillableStepTimeout" in the
                      slurm.conf , but all that has done is prolong the
                      time it takes for the job to timeout. <o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">The
                      default time for the "UnkillableStepTimeout" is 60
                      seconds.<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">After
                      the job completes, it stays in the CG (completing)
                      status for the 60 seconds, then the nodes the job
                      was submitted to go to drain status.<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">On the
                      headnode running slurmctld, I am seeing this in
                      the log - /var/log/slurmctld:<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">--------------------------------------------------------------------------------------------------------------------------------------------<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.000]
                      update_node: node node001 reason set to: Kill task
                      failed<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.001]
                      update_node: node node001 state set to DRAINING<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">On the
                      compute node, I am seeing this in the log -
                      /var/log/slurmd<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">--------------------------------------------------------------------------------------------------------------------------------------------<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.110]
                      [1485.batch] done with job<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.110]
                      [1485.extern] Sent signal 18 to 1485.4294967295<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.111]
                      [1485.extern] Sent signal 15 to 1485.4294967295<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:39:02.820]
                      [1485.extern] Sent SIGKILL signal to
                      1485.4294967295<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.000]
                      [1485.extern] error: *** EXTERN STEP FOR 1485
                      STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02
                      DUE TO JOB NOT ENDING WITH SIGNALS ***<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">I've
                      tried restarting the SLURMD daemon on the compute
                      nodes, and even completing rebooting a few
                      computes nodes (node001, node002) . <o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">From
                      what I've seen were experiencing this on all nodes
                      in the cluster. <o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">I've
                      yet to restart the headnode because there are
                      still active jobs on the system so I don't want to
                      interrupt those.<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">Thank
                      you for your time,<o:p></o:p></span></p>
                  <p class="MsoNormal"><span style="color:black">Ivan<o:p></o:p></span></p>
                  <p class="MsoNormal"><o:p> </o:p></p>
                </div>
              </blockquote>
            </div>
          </blockquote>
        </div>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>