[slurm-users] Slurm - UnkillableStepProgram

Chris Samuel chris at csamuel.org
Tue Mar 23 04:30:01 UTC 2021


Hi Mike,

On 22/3/21 7:12 pm, Yap, Mike wrote:

> # I presume UnkillableStepTimeout is set in slurm.conf. and it act as a 
> timer to trigger UnkillableStepProgram

That is correct.

> # UnkillableStepProgram   can be use to send email or reboot compute node 
> – question is how do we configure it ?

Also - or to automate collecting debug info (which is what we do) and 
then we manually intervene to reboot the node once we've determined 
there's no more useful info to collect.

It's just configured in your slurm.conf.

UnkillableStepProgram=/path/to/the/unkillable/step/script.sh

Of course this script has to be present on every compute node.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list