[slurm-users] Nodes going into drain because of "Kill task failed"

Tue Oct 22 21:22:49 UTC 2019

 A common reason for seeing this is if a process is dropping core -- the kernel will ignore job kill requests until that is complete, so the job isn't being killed as quickly as Slurm would like. I typically recommend increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avoid this.

Andy

-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Will Dennis
Sent: Tuesday, October 22, 2019 4:59 PM
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] Nodes going into drain because of "Kill task failed"

Hi all,

I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of reason: "Kill task failed”

I see the following in slurmd.log —

[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
[2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
[2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
[2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once during execution. This may or may not result in some failure.
[2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2019-10-17T20:07:44.004] [34443.0] done with job

From the above, it seems like the step time limit was reached, and signal 15 (SIGTERM) was sent to the process, which seems to have succeeded at 2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter sent?

What may be the cause of this, and how to prevent this from happening?

Thanks,
Will