<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    Angelos,<br>

    <br>

    I'm glad you mentioned UnkillableStepProgram.  We meant to look at

    that a while ago but forgot about it.  That will be very useful for

    us as well, though the answer for us is pretty much always Lustre

    problems.<br>

    <br>

    Ryan<br>

    <br>

    <div class="moz-cite-prefix">On 7/22/20 1:02 PM, Angelos Ching

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:A6E5FB6B-F87C-4840-9981-8B5C5A4FEA01@clustertech.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      Agreed. You may also want to write a script that gather the list

      of program in "D state" (kernel wait) and print their stack; and

      configure it as UnkillableStepProgram so that you can capture the

      program and relevant system callS that caused the job to become

      unkillable / timed out exiting for further troubleshooting.

      <div><br>

        Regards,</div>

      <div>Angelos<br>

        <div dir="ltr">(Sent from mobile, please pardon me for typos and

          cursoriness.)</div>

        <div dir="ltr"><br>

          <blockquote type="cite">2020/07/23 0:41、Ryan Cox

            <a class="moz-txt-link-rfc2396E" href="mailto:ryan_cox@byu.edu"><ryan_cox@byu.edu></a>のメール:<br>

            <br>

          </blockquote>

        </div>

        <blockquote type="cite">

          <div dir="ltr">

            <meta http-equiv="Content-Type" content="text/html;

              charset=UTF-8">

            Ivan,<br>

            <br>

            Are you having I/O slowness? That is the most common cause

            for us. If it's not that, you'll want to look through all

            the reasons that it takes a long time for a process to

            actually die after a SIGKILL because one of those is the

            likely cause. Typically it's because the process is waiting

            for an I/O syscall to return. Sometimes swap death is the

            culprit, but usually not at the scale that you stated. 

            Maybe you could try reproducing the issue manually or

            putting something in epilog the see the state of the

            processes in the job's cgroup.<br>

            <br>

            Ryan<br>

            <br>

            <div class="moz-cite-prefix">On 7/22/20 10:24 AM, Ivan

              Kovanda wrote:<br>

            </div>

            <blockquote type="cite"

cite="mid:MWHPR11MB006111D9FE2985B00DFC220BF7790@MWHPR11MB0061.namprd11.prod.outlook.com">

              <meta http-equiv="Content-Type" content="text/html;

                charset=UTF-8">

              <meta name="Generator" content="Microsoft Word 15

                (filtered medium)">

              <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:#954F72;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri",sans-serif;

        color:black;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;}</style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

              <div class="WordSection1">

                <p class="MsoNormal"><span style="color:black">Dear

                    slurm community,<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">Currently

                    running slurm version 18.08.4<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">We have

                    been experiencing an issue causing any nodes a slurm

                    job was submitted to to "drain".<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">From what

                    I've seen, it appears that there is a problem with

                    how slurm is cleaning up the job with the SIGKILL

                    process.<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">I've

                    found this slurm article (<a

                      class="moz-txt-link-freetext"

                      href="https://slurm.schedmd.com/troubleshoot.html#completing"

                      moz-do-not-send="true">https://slurm.schedmd.com/troubleshoot.html#completing</a>)

                    , which has a section titled "Jobs and nodes are

                    stuck in COMPLETING state", where it recommends

                    increasing the "UnkillableStepTimeout" in the

                    slurm.conf , but all that has done is prolong the

                    time it takes for the job to timeout. <o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">The

                    default time for the "UnkillableStepTimeout" is 60

                    seconds.<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">After the

                    job completes, it stays in the CG (completing)

                    status for the 60 seconds, then the nodes the job

                    was submitted to go to drain status.<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">On the

                    headnode running slurmctld, I am seeing this in the

                    log - /var/log/slurmctld:<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">--------------------------------------------------------------------------------------------------------------------------------------------<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.000]

                    update_node: node node001 reason set to: Kill task

                    failed<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.001]

                    update_node: node node001 state set to DRAINING<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">On the

                    compute node, I am seeing this in the log -

                    /var/log/slurmd<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">--------------------------------------------------------------------------------------------------------------------------------------------<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.110]

                    [1485.batch] done with job<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.110]

                    [1485.extern] Sent signal 18 to 1485.4294967295<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:38:33.111]

                    [1485.extern] Sent signal 15 to 1485.4294967295<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:39:02.820]

                    [1485.extern] Sent SIGKILL signal to 1485.4294967295<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">[2020-07-21T22:40:03.000]

                    [1485.extern] error: *** EXTERN STEP FOR 1485 STEPD

                    TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO

                    JOB NOT ENDING WITH SIGNALS ***<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">I've

                    tried restarting the SLURMD daemon on the compute

                    nodes, and even completing rebooting a few computes

                    nodes (node001, node002) . <o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">From what

                    I've seen were experiencing this on all nodes in the

                    cluster. <o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">I've yet

                    to restart the headnode because there are still

                    active jobs on the system so I don't want to

                    interrupt those.<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black"><o:p> </o:p></span></p>

                <p class="MsoNormal"><span style="color:black">Thank you

                    for your time,<o:p></o:p></span></p>

                <p class="MsoNormal"><span style="color:black">Ivan<o:p></o:p></span></p>

                <p class="MsoNormal"><o:p> </o:p></p>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <br>

  </body>

</html>