[slurm-users] Nodes going into drain because of "Kill task failed"

Will Dennis wdennis at nec-labs.com
Tue Oct 22 20:58:42 UTC 2019


Hi all,

I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of reason: "Kill task failed”

I see the following in slurmd.log —

[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
[2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
[2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
[2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once during execution. This may or may not result in some failure.
[2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2019-10-17T20:07:44.004] [34443.0] done with job

From the above, it seems like the step time limit was reached, and signal 15 (SIGTERM) was sent to the process, which seems to have succeeded at 2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter sent?

What may be the cause of this, and how to prevent this from happening?

Thanks,
Will


More information about the slurm-users mailing list