[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Mon Nov 30 17:52:18 UTC 2020

I've seen where this was a bug that was fixed
https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally
still. A user cancels his/her job and a node gets drained.
UnkillableStepTimeout=120 is set in slurm.conf

Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2

Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,
ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed

update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec

Do we just increase the timeout value?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201130/06338647/attachment-0001.htm>