On 22-10-2024 16:46, Paul Raines via slurm-users wrote:
I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason)
In stead of cron you can also use Slurm triggers, see for example our scripts in the page https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers You can tailor the triggers to do whatever tasks you need.
We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180
Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent versions of Slurm?
Best regards, Ole