[slurm-users] Slurm - UnkillableStepProgram

Yap, Mike M.Yap at massey.ac.nz
Tue Mar 23 21:18:46 UTC 2021


Hi Chris

Thanks for the clarification 

Mike

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Chris Samuel
Sent: Tuesday, 23 March 2021 5:30 PM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Slurm - UnkillableStepProgram

Hi Mike,

On 22/3/21 7:12 pm, Yap, Mike wrote:

> # I presume UnkillableStepTimeout is set in slurm.conf. and it act as 
> a timer to trigger UnkillableStepProgram

That is correct.

> # UnkillableStepProgram   can be use to send email or reboot compute 
> node - question is how do we configure it ?

Also - or to automate collecting debug info (which is what we do) and then we manually intervene to reboot the node once we've determined there's no more useful info to collect.

It's just configured in your slurm.conf.

UnkillableStepProgram=/path/to/the/unkillable/step/script.sh

Of course this script has to be present on every compute node.

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




More information about the slurm-users mailing list