[slurm-users] Slurm - UnkillableStepProgram
Chris Samuel
chris at csamuel.org
Tue Mar 23 04:30:01 UTC 2021
Hi Mike,
On 22/3/21 7:12 pm, Yap, Mike wrote:
> # I presume UnkillableStepTimeout is set in slurm.conf. and it act as a
> timer to trigger UnkillableStepProgram
That is correct.
> # UnkillableStepProgram can be use to send email or reboot compute node
> – question is how do we configure it ?
Also - or to automate collecting debug info (which is what we do) and
then we manually intervene to reboot the node once we've determined
there's no more useful info to collect.
It's just configured in your slurm.conf.
UnkillableStepProgram=/path/to/the/unkillable/step/script.sh
Of course this script has to be present on every compute node.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list