[slurm-users] [External] Re: Draining hosts because of failing jobs
pbisbal at pppl.gov
Tue May 4 19:44:20 UTC 2021
> I haven't thought about it too hard, but the default NHC scripts do
> not notice that.
That's the problem with NHC and any other problem-checking script: You
have to tell them what errors to check for. As you errors occur, those
scripts inevitably grow longer.
On 5/4/21 12:47 PM, Alex Chekholko wrote:
> In my most recent experience, I have some SSDs in compute nodes that
> occasionally just drop off the bus, so the compute node loses its OS
> disk. I haven't thought about it too hard, but the default NHC
> scripts do not notice that. Similarly, Paul's proposed script might
> need to also check that the slurm log file is readable.
> The way I detect it myself is when a random swath of jobs fails and
> then when I SSH to the node and get an I/O error instead of a regular
> On Tue, May 4, 2021 at 9:41 AM Paul Edmon <pedmon at cfa.harvard.edu
> <mailto:pedmon at cfa.harvard.edu>> wrote:
> Since you can run an arbitrary script as a node health checker I
> add a script that counts failures and then closes if it hits a
> threshold. The script shouldn't need to talk to the slurmctld or
> slurmdbd as it should be able to watch the log on the node and see
> the fail.
> -Paul Edmon-
> On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
> > Hello,
> > how do you implement something like "drain host after 10 consecutive
> > failed jobs"? Unlike a host check script, that checks for known
> > I'd like to stop killing jobs just because one node is faulty.
> > Gerhard
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users