[slurm-users] Draining hosts because of failing jobs

Tue May 4 16:47:45 UTC 2021

In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS disk.
I haven't thought about it too hard, but the default NHC scripts do not
notice that.  Similarly, Paul's proposed script might need to also check
that the slurm log file is readable.
The way I detect it myself is when a random swath of jobs fails and then
when I SSH to the node and get an I/O error instead of a regular
connection.

On Tue, May 4, 2021 at 9:41 AM Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> Since you can run an arbitrary script as a node health checker I might
> add a script that counts failures and then closes if it hits a
> threshold.  The script shouldn't need to talk to the slurmctld or
> slurmdbd as it should be able to watch the log on the node and see the
> fail.
>
> -Paul Edmon-
>
> On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
> > Hello,
> >
> > how do you implement something like "drain host after 10 consecutive
> > failed jobs"? Unlike a host check script, that checks for known errors,
> > I'd like to stop killing jobs just because one node is faulty.
> >
> > Gerhard
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210504/fe6323bb/attachment.htm>