<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Adrian and Diego,</p>
    <p>Are you using AMD Epyc processors when viewing this issue? I've
      been having the same issue but only on dual AMD Epyc systems. I
      haven't tried changing the core file location from an NFS mount
      though so perhaps there's an issue writing it out in time.</p>
    <p>How did you disable core files?</p>
    <p>Regards,<br>
    </p>
    <div class="moz-signature">
      
      <title></title>
      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td width="150" valign="top" height="30" align="left">
              <p style="font-size:14px;">Willy Markuske</p>
            </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">HPC Systems Engineer</p>
            </td>
            <td rowspan="3" width="180" valign="center" height="42" align="center"><tt><img moz-do-not-send="false" src="cid:part1.740BAA87.EF586C76@sdsc.edu" alt="" width="168" height="48"></tt> </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">Research Data Services</p>
            </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">P: (619) 519-4435</p>
            </td>
          </tr>
        </tbody>
      </table>
      <p> </p>
    </div>
    <div class="moz-cite-prefix">On 8/6/21 6:16 AM, Adrian Sevcenco
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:b6dd2426-b4d4-19e2-4a45-7aeabc407451@spacescience.ro">On
      8/6/21 3:19 PM, Diego Zuccato wrote:
      <br>
      <blockquote type="cite">IIRC we increased SlurmdTimeout to 7200 .
        <br>
      </blockquote>
      Thanks a lot!
      <br>
      <br>
      Adrian
      <br>
      <br>
      <blockquote type="cite">
        <br>
        Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
        <br>
        <blockquote type="cite">On 8/6/21 1:56 PM, Diego Zuccato wrote:
          <br>
          <blockquote type="cite">We had a similar problem some time ago
            (slow creation of big core files) and solved it by
            increasing the Slurm timeouts
            <br>
          </blockquote>
          oh, i see.. well, in principle i should not have core files,
          and i do not find any...
          <br>
          <br>
          <blockquote type="cite">to the point that even the slowest
            core wouldn't trigger it. Then, once the need for core files
            was over, I disabled core files and restored the timeouts.
            <br>
          </blockquote>
          and how much did you increased them? i have
          <br>
          SlurmctldTimeout=300
          <br>
          SlurmdTimeout=300
          <br>
          <br>
          Thank you!
          <br>
          Adrian
          <br>
          <br>
          <br>
          <blockquote type="cite">
            <br>
            Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
            <br>
            <blockquote type="cite">On 8/6/21 1:27 PM, Diego Zuccato
              wrote:
              <br>
              <blockquote type="cite">Hi.
                <br>
              </blockquote>
              Hi!
              <br>
              <br>
              <blockquote type="cite">Might it be due to a timeout
                (maybe the killed job is creating a core file, or caused
                heavy swap usage)?
                <br>
              </blockquote>
              i will have to search for culprit ..
              <br>
              the problem is why would the node be put in drain for the
              reason of failed killing? and how can i control/disable
              <br>
              this?
              <br>
              <br>
              Thank you!
              <br>
              Adrian
              <br>
              <br>
              <br>
              <blockquote type="cite">
                <br>
                BYtE,
                <br>
                  Diego
                <br>
                <br>
                Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
                <br>
                <blockquote type="cite">Having just implemented some
                  triggers i just noticed this:
                  <br>
                  <br>
                  NODELIST    NODES PARTITION       STATE CPUS    S:C:T
                  MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
                  <br>
                  alien-0-47      1    alien*    draining   48   48:1:1
                  193324 214030      1 rack-0,4 Kill task failed
                  <br>
                  alien-0-56      1    alien*     drained   48   48:1:1
                  193324 214030      1 rack-0,4 Kill task failed
                  <br>
                  <br>
                  i was wondering why a node is drained when killing of
                  task fails and how can i disable it? (i use cgroups)
                  <br>
                  moreover, how can the killing of task fails? (this is
                  on slurm 19.05)
                  <br>
                  <br>
                  Thank you!
                  <br>
                  Adrian
                  <br>
                  <br>
                </blockquote>
              </blockquote>
            </blockquote>
          </blockquote>
        </blockquote>
      </blockquote>
      <br>
      <br>
    </blockquote>
  </body>
</html>