[slurm-users] Nodes going into drain because of "Kill task failed"

Thu Jul 23 13:18:37 UTC 2020

Same here.  Whenever we see rashes of Kill task failed it is invariably 
symptomatic of one of our Lustre filesystems acting up or being saturated.

-Paul Edmon-

On 7/22/2020 3:21 PM, Ryan Cox wrote:
> Angelos,
>
> I'm glad you mentioned UnkillableStepProgram.  We meant to look at 
> that a while ago but forgot about it.  That will be very useful for us 
> as well, though the answer for us is pretty much always Lustre problems.
>
> Ryan
>
> On 7/22/20 1:02 PM, Angelos Ching wrote:
>> Agreed. You may also want to write a script that gather the list of 
>> program in "D state" (kernel wait) and print their stack; and 
>> configure it as UnkillableStepProgram so that you can capture the 
>> program and relevant system callS that caused the job to become 
>> unkillable / timed out exiting for further troubleshooting.
>>
>> Regards,
>> Angelos
>> (Sent from mobile, please pardon me for typos and cursoriness.)
>>
>>> 2020/07/23 0:41、Ryan Cox <ryan_cox at byu.edu>のメール:
>>>
>>>  Ivan,
>>>
>>> Are you having I/O slowness? That is the most common cause for us. 
>>> If it's not that, you'll want to look through all the reasons that 
>>> it takes a long time for a process to actually die after a SIGKILL 
>>> because one of those is the likely cause. Typically it's because the 
>>> process is waiting for an I/O syscall to return. Sometimes swap 
>>> death is the culprit, but usually not at the scale that you stated.  
>>> Maybe you could try reproducing the issue manually or putting 
>>> something in epilog the see the state of the processes in the job's 
>>> cgroup.
>>>
>>> Ryan
>>>
>>> On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>>>>
>>>> Dear slurm community,
>>>>
>>>> Currently running slurm version 18.08.4
>>>>
>>>> We have been experiencing an issue causing any nodes a slurm job 
>>>> was submitted to to "drain".
>>>>
>>>> From what I've seen, it appears that there is a problem with how 
>>>> slurm is cleaning up the job with the SIGKILL process.
>>>>
>>>> I've found this slurm article 
>>>> (https://slurm.schedmd.com/troubleshoot.html#completing) , which 
>>>> has a section titled "Jobs and nodes are stuck in COMPLETING 
>>>> state", where it recommends increasing the "UnkillableStepTimeout" 
>>>> in the slurm.conf , but all that has done is prolong the time it 
>>>> takes for the job to timeout.
>>>>
>>>> The default time for the "UnkillableStepTimeout" is 60 seconds.
>>>>
>>>> After the job completes, it stays in the CG (completing) status for 
>>>> the 60 seconds, then the nodes the job was submitted to go to drain 
>>>> status.
>>>>
>>>> On the headnode running slurmctld, I am seeing this in the log - 
>>>> /var/log/slurmctld:
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> [2020-07-21T22:40:03.000] update_node: node node001 reason set to: 
>>>> Kill task failed
>>>>
>>>> [2020-07-21T22:40:03.001] update_node: node node001 state set to 
>>>> DRAINING
>>>>
>>>> On the compute node, I am seeing this in the log - /var/log/slurmd
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> [2020-07-21T22:38:33.110] [1485.batch] done with job
>>>>
>>>> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 
>>>> 1485.4294967295
>>>>
>>>> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 
>>>> 1485.4294967295
>>>>
>>>> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 
>>>> 1485.4294967295
>>>>
>>>> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 
>>>> 1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB 
>>>> NOT ENDING WITH SIGNALS ***
>>>>
>>>> I've tried restarting the SLURMD daemon on the compute nodes, and 
>>>> even completing rebooting a few computes nodes (node001, node002) .
>>>>
>>>> From what I've seen were experiencing this on all nodes in the 
>>>> cluster.
>>>>
>>>> I've yet to restart the headnode because there are still active 
>>>> jobs on the system so I don't want to interrupt those.
>>>>
>>>> Thank you for your time,
>>>>
>>>> Ivan
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200723/6bee469c/attachment.htm>