<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>
<blockquote type="cite"> I haven't thought about it too hard, but
the default NHC scripts do not notice that. </blockquote>
</p>
<p>That's the problem with NHC and any other problem-checking
script: You have to tell them what errors to check for. As you
errors occur, those scripts inevitably grow longer. </p>
<p>--<br>
Prentice<br>
</p>
<pre class="moz-signature" cols="72">
</pre>
<div class="moz-cite-prefix">On 5/4/21 12:47 PM, Alex Chekholko
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CANcy_Pb16r1y_0dobtjwZqZZHz3HL+YGhVG4Fha1F8k3UVQEnw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">In my most recent experience, I have some SSDs in
compute nodes that occasionally just drop off the bus, so the
compute node loses its OS disk. I haven't thought about it too
hard, but the default NHC scripts do not notice that.
Similarly, Paul's proposed script might need to also check that
the slurm log file is readable.
<div>The way I detect it myself is when a random swath of jobs
fails and then when I SSH to the node and get an I/O error
instead of a regular connection. </div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, May 4, 2021 at 9:41 AM
Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"
moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Since
you can run an arbitrary script as a node health checker I
might <br>
add a script that counts failures and then closes if it hits a
<br>
threshold. The script shouldn't need to talk to the slurmctld
or <br>
slurmdbd as it should be able to watch the log on the node and
see the fail.<br>
<br>
-Paul Edmon-<br>
<br>
On 5/4/2021 12:09 PM, Gerhard Strangar wrote:<br>
> Hello,<br>
><br>
> how do you implement something like "drain host after 10
consecutive<br>
> failed jobs"? Unlike a host check script, that checks for
known errors,<br>
> I'd like to stop killing jobs just because one node is
faulty.<br>
><br>
> Gerhard<br>
><br>
<br>
</blockquote>
</div>
</blockquote>
</body>
</html>