[slurm-users] [External] Re: Draining hosts because of failing jobs

Tue May 4 19:44:20 UTC 2021

> I haven't thought about it too hard, but the default NHC scripts do 
> not notice that. 

That's the problem with NHC and any other problem-checking script: You 
have to tell them what errors to check for. As you errors occur, those 
scripts inevitably grow longer.

--
Prentice

On 5/4/21 12:47 PM, Alex Chekholko wrote:
> In my most recent experience, I have some SSDs in compute nodes that 
> occasionally just drop off the bus, so the compute node loses its OS 
> disk.  I haven't thought about it too hard, but the default NHC 
> scripts do not notice that. Similarly, Paul's proposed script might 
> need to also check that the slurm log file is readable.
> The way I detect it myself is when a random swath of jobs fails and 
> then when I SSH to the node and get an I/O error instead of a regular 
> connection.
>
> On Tue, May 4, 2021 at 9:41 AM Paul Edmon <pedmon at cfa.harvard.edu 
> <mailto:pedmon at cfa.harvard.edu>> wrote:
>
>     Since you can run an arbitrary script as a node health checker I
>     might
>     add a script that counts failures and then closes if it hits a
>     threshold.  The script shouldn't need to talk to the slurmctld or
>     slurmdbd as it should be able to watch the log on the node and see
>     the fail.
>
>     -Paul Edmon-
>
>     On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
>     > Hello,
>     >
>     > how do you implement something like "drain host after 10 consecutive
>     > failed jobs"? Unlike a host check script, that checks for known
>     errors,
>     > I'd like to stop killing jobs just because one node is faulty.
>     >
>     > Gerhard
>     >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210504/56e773cd/attachment.htm>