[slurm-users] Nodes going into drain because of "Kill task failed"

Wed Oct 23 11:21:09 UTC 2019

Excellent points raised here!  Two other things to do when you see "kill task failed":

1. Check "dmesg -T" on the suspect node to look for significant system events, like file system problems, communication problems, etc., around the time that the problem was logged
2. Check /var/log/slurm (or whatever is appropriate on you system)for core files that correspond to the time reported for "kill task failed"

Andy

-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Marcus Boden
Sent: Wednesday, October 23, 2019 2:34 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Nodes going into drain because of "Kill task failed"

you can also use the UnkillableStepProgram to debug things:

> UnkillableStepProgram
>     If the processes in a job step are determined to be unkillable for a period of time specified by the UnkillableStepTimeout variable, the program specified by UnkillableStepProgram will be executed. This program can be used to take special actions to clean up the unkillable processes and/or notify computer administrators. The program will be run SlurmdUser (usually "root") on the compute node. By default no program is run.
> UnkillableStepTimeout
>     The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute UnkillableStepProgram as described above. The default timeout value is 60 seconds. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node.

this allows you to find out what causes the problem at the time, when
the Problem occurs. You could for example use lsof to see if there are
any files open due to a hanging fs and mail the output to yourself.

Best,
Marcus

On 19-10-22 20:49, Paul Edmon wrote:
> It can also happen if you have a stalled out filesystem or stuck processes. 
> I've gotten in the habit of doing a daily patrol for them to clean them up. 
> Most of them time you can just reopen the node but sometimes this indicates
> something is wedged.
> 
> -Paul Edmon-
> 
> On 10/22/2019 5:22 PM, Riebs, Andy wrote:
> >   A common reason for seeing this is if a process is dropping core -- the kernel will ignore job kill requests until that is complete, so the job isn't being killed as quickly as Slurm would like. I typically recommend increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avoid this.
> > 
> > Andy
> > 
> > -----Original Message-----
> > From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Will Dennis
> > Sent: Tuesday, October 22, 2019 4:59 PM
> > To: slurm-users at lists.schedmd.com
> > Subject: [slurm-users] Nodes going into drain because of "Kill task failed"
> > 
> > Hi all,
> > 
> > I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of reason: "Kill task failed”
> > 
> > I see the following in slurmd.log —
> > 
> > [2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
> > [2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
> > [2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
> > [2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
> > [2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
> > [2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
> > [2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once during execution. This may or may not result in some failure.
> > [2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
> > [2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
> > [2019-10-17T20:07:44.004] [34443.0] done with job
> > 
> >  From the above, it seems like the step time limit was reached, and signal 15 (SIGTERM) was sent to the process, which seems to have succeeded at 2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter sent?
> > 
> > What may be the cause of this, and how to prevent this from happening?
> > 
> > Thanks,
> > Will
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mboden at gwdg.de
---------------------------------------
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:    http://www.gwdg.de
E-Mail: gwdg at gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:    +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---------------------------------------