<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hello Robert,<br>
    </p>
    <p>I've been having the same issue with BCM, CentOS 8.2 BCM 9.0
      Slurm 20.02.3. It seems to have started to occur when I enabled
      proctrack/cgroup and changed select/linear to select/con_tres.</p>
    <p>Are you using cgroup process tracking and have you manipulated
      the cgroup.conf file? Do jobs complete correctly when not
      cancelled?</p>
    <p>Regards,<br>
    </p>
    <div class="moz-signature">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <title></title>
      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td width="150" valign="top" height="30" align="left">
              <p style="font-size:14px;">Willy Markuske</p>
            </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">HPC Systems Engineer</p>
            </td>
            <td rowspan="3" width="180" valign="center" height="42"
              align="center"><tt><img moz-do-not-send="false"
                  src="cid:part1.D4139E68.85EB4280@sdsc.edu" alt=""
                  width="168" height="48"></tt> </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">Research Data Services</p>
            </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">P: (858) 246-5593</p>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="moz-cite-prefix">On 11/30/20 10:54 AM, Alex Chekholko
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CANcy_Pbrbp9v=eztqBAM-FMJ1Wnt363CLe0=aQs7vaZLXeg0TA@mail.gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div dir="ltr">This may be more "cargo cult" but I've advised
        users to add a "sleep 60" to the end of their job scripts if
        they are "I/O intensive".  Sometimes they are somehow able to
        generate I/O in a way that slurm thinks the job is finished, but
        the OS is still catching up on the I/O, and then slurm tries to
        kill the job...</div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, Nov 30, 2020 at 10:49
          AM Robert Kudyba <<a href="mailto:rkudyba@fordham.edu"
            moz-do-not-send="true">rkudyba@fordham.edu</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">Sure I've seen that in some of the posts
            here, e.g., a NAS. But in this case it's a NFS share to the
            local RAID10 storage. There aren't any other settings that
            deal with this to not drain a node?</div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Mon, Nov 30, 2020 at
              1:02 PM Paul Edmon <<a
                href="mailto:pedmon@cfa.harvard.edu" target="_blank"
                moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">That can help.  Usually
              this happens due to laggy storage the job is <br>
              using taking time flushing the job's data.  So making sure
              that your <br>
              storage is up, responsive, and stable will also cut these
              down.<br>
              <br>
              -Paul Edmon-<br>
              <br>
              On 11/30/2020 12:52 PM, Robert Kudyba wrote:<br>
              > I've seen where this was a bug that was fixed <br>
              > <a
href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e="
                rel="noreferrer" target="_blank" moz-do-not-send="true">https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=</a> 
              <br>
              > <<a
href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e="
                rel="noreferrer" target="_blank" moz-do-not-send="true">https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=</a>
              > but this happens <br>
              > occasionally still. A user cancels his/her job and a
              node gets <br>
              > drained. UnkillableStepTimeout=120 is set in
              slurm.conf<br>
              ><br>
              > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster
              8.2<br>
              ><br>
              > Slurm Job_id=6908 Name=run.sh Ended, Run time
              7-17:50:36, CANCELLED, <br>
              > ExitCode 0<br>
              > Resending TERMINATE_JOB request JobId=6908
              Nodelist=node001<br>
              > update_node: node node001 reason set to: Kill task
              failed<br>
              > update_node: node node001 state set to DRAINING<br>
              > error: slurmd error running JobId=6908 on
              node(s)=node001: Kill task <br>
              > failed<br>
              ><br>
              > update_node: node node001 reason set to: hung<br>
              > update_node: node node001 state set to DOWN<br>
              > update_node: node node001 state set to IDLE<br>
              > error: Nodes node001 not responding<br>
              ><br>
              > scontrol show config | grep kill<br>
              > UnkillableStepProgram   = (null)<br>
              > UnkillableStepTimeout   = 120 sec<br>
              ><br>
              > Do we just increase the timeout value?<br>
              <br>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>