[slurm-users] Nodes going into drain because of "Kill task failed"

Wed Oct 23 00:49:06 UTC 2019

It can also happen if you have a stalled out filesystem or stuck 
processes.  I've gotten in the habit of doing a daily patrol for them to 
clean them up.  Most of them time you can just reopen the node but 
sometimes this indicates something is wedged.

-Paul Edmon-

On 10/22/2019 5:22 PM, Riebs, Andy wrote:
>   A common reason for seeing this is if a process is dropping core -- the kernel will ignore job kill requests until that is complete, so the job isn't being killed as quickly as Slurm would like. I typically recommend increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avoid this.
>
> Andy
>
> -----Original Message-----
> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Will Dennis
> Sent: Tuesday, October 22, 2019 4:59 PM
> To: slurm-users at lists.schedmd.com
> Subject: [slurm-users] Nodes going into drain because of "Kill task failed"
>
> Hi all,
>
> I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of reason: "Kill task failed”
>
> I see the following in slurmd.log —
>
> [2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
> [2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
> [2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
> [2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
> [2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
> [2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
> [2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once during execution. This may or may not result in some failure.
> [2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
> [2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
> [2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
> [2019-10-17T20:07:44.004] [34443.0] done with job
>
>  From the above, it seems like the step time limit was reached, and signal 15 (SIGTERM) was sent to the process, which seems to have succeeded at 2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter sent?
>
> What may be the cause of this, and how to prevent this from happening?
>
> Thanks,
> Will