<div dir="ltr"><div>I've seen where this was a bug that was fixed <a href="https://bugs.schedmd.com/show_bug.cgi?id=3941">https://bugs.schedmd.com/show_bug.cgi?id=3941</a> but this happens occasionally still. A user cancels his/her job and a node gets drained.
UnkillableStepTimeout=120 is set in slurm.conf
</div><div><br></div>Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2<div><br>Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, ExitCode 0<br>Resending TERMINATE_JOB request JobId=6908 Nodelist=node001<br>update_node: node node001 reason set to: Kill task failed<br>update_node: node node001 state set to DRAINING<br>error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed<br><br>update_node: node node001 reason set to: hung<br>update_node: node node001 state set to DOWN<br>update_node: node node001 state set to IDLE<br>error: Nodes node001 not responding<br><br>scontrol show config | grep kill<br>UnkillableStepProgram = (null)<br>UnkillableStepTimeout = 120 sec<br><br>Do we just increase the timeout value? <br></div></div>