<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Adrian and Diego,</p>

    <p>Are you using AMD Epyc processors when viewing this issue? I've

      been having the same issue but only on dual AMD Epyc systems. I

      haven't tried changing the core file location from an NFS mount

      though so perhaps there's an issue writing it out in time.</p>

    <p>How did you disable core files?</p>

    <p>Regards,<br>

    </p>

    <div class="moz-signature">

      <title></title>

      <table cellspacing="0" cellpadding="0" border="0">

        <tbody>

          <tr>

            <td width="150" valign="top" height="30" align="left">

              <p style="font-size:14px;">Willy Markuske</p>

            </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">HPC Systems Engineer</p>

            </td>

            <td rowspan="3" width="180" valign="center" height="42" align="center"><tt><img moz-do-not-send="false" src="cid:part1.740BAA87.EF586C76@sdsc.edu" alt="" width="168" height="48"></tt> </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">Research Data Services</p>

            </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">P: (619) 519-4435</p>

            </td>

          </tr>

        </tbody>

      </table>

      <p> </p>

    </div>

    <div class="moz-cite-prefix">On 8/6/21 6:16 AM, Adrian Sevcenco

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:b6dd2426-b4d4-19e2-4a45-7aeabc407451@spacescience.ro">On

      8/6/21 3:19 PM, Diego Zuccato wrote:

      <br>

      <blockquote type="cite">IIRC we increased SlurmdTimeout to 7200 .

        <br>

      </blockquote>

      Thanks a lot!

      <br>

      <br>

      Adrian

      <br>

      <br>

      <blockquote type="cite">

        <br>

        Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

        <br>

        <blockquote type="cite">On 8/6/21 1:56 PM, Diego Zuccato wrote:

          <br>

          <blockquote type="cite">We had a similar problem some time ago

            (slow creation of big core files) and solved it by

            increasing the Slurm timeouts

            <br>

          </blockquote>

          oh, i see.. well, in principle i should not have core files,

          and i do not find any...

          <br>

          <br>

          <blockquote type="cite">to the point that even the slowest

            core wouldn't trigger it. Then, once the need for core files

            was over, I disabled core files and restored the timeouts.

            <br>

          </blockquote>

          and how much did you increased them? i have

          <br>

          SlurmctldTimeout=300

          <br>

          SlurmdTimeout=300

          <br>

          <br>

          Thank you!

          <br>

          Adrian

          <br>

          <br>

          <br>

          <blockquote type="cite">

            <br>

            Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

            <br>

            <blockquote type="cite">On 8/6/21 1:27 PM, Diego Zuccato

              wrote:

              <br>

              <blockquote type="cite">Hi.

                <br>

              </blockquote>

              Hi!

              <br>

              <br>

              <blockquote type="cite">Might it be due to a timeout

                (maybe the killed job is creating a core file, or caused

                heavy swap usage)?

                <br>

              </blockquote>

              i will have to search for culprit ..

              <br>

              the problem is why would the node be put in drain for the

              reason of failed killing? and how can i control/disable

              <br>

              this?

              <br>

              <br>

              Thank you!

              <br>

              Adrian

              <br>

              <br>

              <br>

              <blockquote type="cite">

                <br>

                BYtE,

                <br>

                  Diego

                <br>

                <br>

                Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

                <br>

                <blockquote type="cite">Having just implemented some

                  triggers i just noticed this:

                  <br>

                  <br>

                  NODELIST    NODES PARTITION       STATE CPUS    S:C:T

                  MEMORY TMP_DISK WEIGHT AVAIL_FE REASON

                  <br>

                  alien-0-47      1    alien*    draining   48   48:1:1

                  193324 214030      1 rack-0,4 Kill task failed

                  <br>

                  alien-0-56      1    alien*     drained   48   48:1:1

                  193324 214030      1 rack-0,4 Kill task failed

                  <br>

                  <br>

                  i was wondering why a node is drained when killing of

                  task fails and how can i disable it? (i use cgroups)

                  <br>

                  moreover, how can the killing of task fails? (this is

                  on slurm 19.05)

                  <br>

                  <br>

                  Thank you!

                  <br>

                  Adrian

                  <br>

                  <br>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

        </blockquote>

      </blockquote>

      <br>

      <br>

    </blockquote>

  </body>

</html>