<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>

      <blockquote type="cite"> I haven't thought about it too hard, but

        the default NHC scripts do not notice that.  </blockquote>

    </p>

    <p>That's the problem with NHC and any other problem-checking

      script: You have to tell them what errors to check for. As you

      errors occur, those scripts inevitably grow longer. </p>

    <p>--<br>

      Prentice<br>

    </p>

    <pre class="moz-signature" cols="72">

</pre>

    <div class="moz-cite-prefix">On 5/4/21 12:47 PM, Alex Chekholko

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CANcy_Pb16r1y_0dobtjwZqZZHz3HL+YGhVG4Fha1F8k3UVQEnw@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">In my most recent experience, I have some SSDs in

        compute nodes that occasionally just drop off the bus, so the

        compute node loses its OS disk.  I haven't thought about it too

        hard, but the default NHC scripts do not notice that. 

        Similarly, Paul's proposed script might need to also check that

        the slurm log file is readable.

        <div>The way I detect it myself is when a random swath of jobs

          fails and then when I SSH to the node and get an I/O error

          instead of a regular connection. </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Tue, May 4, 2021 at 9:41 AM

          Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"

            moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Since

          you can run an arbitrary script as a node health checker I

          might <br>

          add a script that counts failures and then closes if it hits a

          <br>

          threshold.  The script shouldn't need to talk to the slurmctld

          or <br>

          slurmdbd as it should be able to watch the log on the node and

          see the fail.<br>

          <br>

          -Paul Edmon-<br>

          <br>

          On 5/4/2021 12:09 PM, Gerhard Strangar wrote:<br>

          > Hello,<br>

          ><br>

          > how do you implement something like "drain host after 10

          consecutive<br>

          > failed jobs"? Unlike a host check script, that checks for

          known errors,<br>

          > I'd like to stop killing jobs just because one node is

          faulty.<br>

          ><br>

          > Gerhard<br>

          ><br>

          <br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>