[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Paul Edmon pedmon at cfa.harvard.edu
Mon Nov 30 18:01:00 UTC 2020


That can help.  Usually this happens due to laggy storage the job is 
using taking time flushing the job's data.  So making sure that your 
storage is up, responsive, and stable will also cut these down.

-Paul Edmon-

On 11/30/2020 12:52 PM, Robert Kudyba wrote:
> I've seen where this was a bug that was fixed 
> https://bugs.schedmd.com/show_bug.cgi?id=3941 
> <https://bugs.schedmd.com/show_bug.cgi?id=3941> but this happens 
> occasionally still. A user cancels his/her job and a node gets 
> drained. UnkillableStepTimeout=120 is set in slurm.conf
>
> Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
>
> Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, 
> ExitCode 0
> Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
> update_node: node node001 reason set to: Kill task failed
> update_node: node node001 state set to DRAINING
> error: slurmd error running JobId=6908 on node(s)=node001: Kill task 
> failed
>
> update_node: node node001 reason set to: hung
> update_node: node node001 state set to DOWN
> update_node: node node001 state set to IDLE
> error: Nodes node001 not responding
>
> scontrol show config | grep kill
> UnkillableStepProgram   = (null)
> UnkillableStepTimeout   = 120 sec
>
> Do we just increase the timeout value?



More information about the slurm-users mailing list