[slurm-users] Draining hosts because of failing jobs

Tue May 4 16:40:00 UTC 2021

Since you can run an arbitrary script as a node health checker I might 
add a script that counts failures and then closes if it hits a 
threshold.  The script shouldn't need to talk to the slurmctld or 
slurmdbd as it should be able to watch the log on the node and see the fail.

-Paul Edmon-

On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
> Hello,
>
> how do you implement something like "drain host after 10 consecutive
> failed jobs"? Unlike a host check script, that checks for known errors,
> I'd like to stop killing jobs just because one node is faulty.
>
> Gerhard
>