[slurm-users] Nodes going into drain because of "Kill task failed"
Paul Edmon
pedmon at cfa.harvard.edu
Thu Jul 23 13:18:37 UTC 2020
Same here. Whenever we see rashes of Kill task failed it is invariably
symptomatic of one of our Lustre filesystems acting up or being saturated.
-Paul Edmon-
On 7/22/2020 3:21 PM, Ryan Cox wrote:
> Angelos,
>
> I'm glad you mentioned UnkillableStepProgram. We meant to look at
> that a while ago but forgot about it. That will be very useful for us
> as well, though the answer for us is pretty much always Lustre problems.
>
> Ryan
>
> On 7/22/20 1:02 PM, Angelos Ching wrote:
>> Agreed. You may also want to write a script that gather the list of
>> program in "D state" (kernel wait) and print their stack; and
>> configure it as UnkillableStepProgram so that you can capture the
>> program and relevant system callS that caused the job to become
>> unkillable / timed out exiting for further troubleshooting.
>>
>> Regards,
>> Angelos
>> (Sent from mobile, please pardon me for typos and cursoriness.)
>>
>>> 2020/07/23 0:41、Ryan Cox <ryan_cox at byu.edu>のメール:
>>>
>>> Ivan,
>>>
>>> Are you having I/O slowness? That is the most common cause for us.
>>> If it's not that, you'll want to look through all the reasons that
>>> it takes a long time for a process to actually die after a SIGKILL
>>> because one of those is the likely cause. Typically it's because the
>>> process is waiting for an I/O syscall to return. Sometimes swap
>>> death is the culprit, but usually not at the scale that you stated.
>>> Maybe you could try reproducing the issue manually or putting
>>> something in epilog the see the state of the processes in the job's
>>> cgroup.
>>>
>>> Ryan
>>>
>>> On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>>>>
>>>> Dear slurm community,
>>>>
>>>> Currently running slurm version 18.08.4
>>>>
>>>> We have been experiencing an issue causing any nodes a slurm job
>>>> was submitted to to "drain".
>>>>
>>>> From what I've seen, it appears that there is a problem with how
>>>> slurm is cleaning up the job with the SIGKILL process.
>>>>
>>>> I've found this slurm article
>>>> (https://slurm.schedmd.com/troubleshoot.html#completing) , which
>>>> has a section titled "Jobs and nodes are stuck in COMPLETING
>>>> state", where it recommends increasing the "UnkillableStepTimeout"
>>>> in the slurm.conf , but all that has done is prolong the time it
>>>> takes for the job to timeout.
>>>>
>>>> The default time for the "UnkillableStepTimeout" is 60 seconds.
>>>>
>>>> After the job completes, it stays in the CG (completing) status for
>>>> the 60 seconds, then the nodes the job was submitted to go to drain
>>>> status.
>>>>
>>>> On the headnode running slurmctld, I am seeing this in the log -
>>>> /var/log/slurmctld:
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> [2020-07-21T22:40:03.000] update_node: node node001 reason set to:
>>>> Kill task failed
>>>>
>>>> [2020-07-21T22:40:03.001] update_node: node node001 state set to
>>>> DRAINING
>>>>
>>>> On the compute node, I am seeing this in the log - /var/log/slurmd
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> [2020-07-21T22:38:33.110] [1485.batch] done with job
>>>>
>>>> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to
>>>> 1485.4294967295
>>>>
>>>> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to
>>>> 1485.4294967295
>>>>
>>>> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to
>>>> 1485.4294967295
>>>>
>>>> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR
>>>> 1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB
>>>> NOT ENDING WITH SIGNALS ***
>>>>
>>>> I've tried restarting the SLURMD daemon on the compute nodes, and
>>>> even completing rebooting a few computes nodes (node001, node002) .
>>>>
>>>> From what I've seen were experiencing this on all nodes in the
>>>> cluster.
>>>>
>>>> I've yet to restart the headnode because there are still active
>>>> jobs on the system so I don't want to interrupt those.
>>>>
>>>> Thank you for your time,
>>>>
>>>> Ivan
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200723/6bee469c/attachment.htm>
More information about the slurm-users
mailing list