<div dir="ltr">In my most recent experience, I have some SSDs in compute nodes that occasionally just drop off the bus, so the compute node loses its OS disk.  I haven't thought about it too hard, but the default NHC scripts do not notice that.  Similarly, Paul's proposed script might need to also check that the slurm log file is readable.<div>The way I detect it myself is when a random swath of jobs fails and then when I SSH to the node and get an I/O error instead of a regular connection. </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 4, 2021 at 9:41 AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu">pedmon@cfa.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Since you can run an arbitrary script as a node health checker I might <br>

add a script that counts failures and then closes if it hits a <br>

threshold.  The script shouldn't need to talk to the slurmctld or <br>

slurmdbd as it should be able to watch the log on the node and see the fail.<br>

<br>

-Paul Edmon-<br>

<br>

On 5/4/2021 12:09 PM, Gerhard Strangar wrote:<br>

> Hello,<br>

><br>

> how do you implement something like "drain host after 10 consecutive<br>

> failed jobs"? Unlike a host check script, that checks for known errors,<br>

> I'd like to stop killing jobs just because one node is faulty.<br>

><br>

> Gerhard<br>

><br>

<br>

</blockquote></div>