[slurm-users] Nodes going into drain because of "Kill task failed"
Ryan Cox
ryan_cox at byu.edu
Wed Jul 22 19:21:13 UTC 2020
Angelos,
I'm glad you mentioned UnkillableStepProgram. We meant to look at that
a while ago but forgot about it. That will be very useful for us as
well, though the answer for us is pretty much always Lustre problems.
Ryan
On 7/22/20 1:02 PM, Angelos Ching wrote:
> Agreed. You may also want to write a script that gather the list of
> program in "D state" (kernel wait) and print their stack; and
> configure it as UnkillableStepProgram so that you can capture the
> program and relevant system callS that caused the job to become
> unkillable / timed out exiting for further troubleshooting.
>
> Regards,
> Angelos
> (Sent from mobile, please pardon me for typos and cursoriness.)
>
>> 2020/07/23 0:41、Ryan Cox <ryan_cox at byu.edu>のメール:
>>
>> Ivan,
>>
>> Are you having I/O slowness? That is the most common cause for us. If
>> it's not that, you'll want to look through all the reasons that it
>> takes a long time for a process to actually die after a SIGKILL
>> because one of those is the likely cause. Typically it's because the
>> process is waiting for an I/O syscall to return. Sometimes swap death
>> is the culprit, but usually not at the scale that you stated. Maybe
>> you could try reproducing the issue manually or putting something in
>> epilog the see the state of the processes in the job's cgroup.
>>
>> Ryan
>>
>> On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>>>
>>> Dear slurm community,
>>>
>>> Currently running slurm version 18.08.4
>>>
>>> We have been experiencing an issue causing any nodes a slurm job was
>>> submitted to to "drain".
>>>
>>> From what I've seen, it appears that there is a problem with how
>>> slurm is cleaning up the job with the SIGKILL process.
>>>
>>> I've found this slurm article
>>> (https://slurm.schedmd.com/troubleshoot.html#completing) , which has
>>> a section titled "Jobs and nodes are stuck in COMPLETING state",
>>> where it recommends increasing the "UnkillableStepTimeout" in the
>>> slurm.conf , but all that has done is prolong the time it takes for
>>> the job to timeout.
>>>
>>> The default time for the "UnkillableStepTimeout" is 60 seconds.
>>>
>>> After the job completes, it stays in the CG (completing) status for
>>> the 60 seconds, then the nodes the job was submitted to go to drain
>>> status.
>>>
>>> On the headnode running slurmctld, I am seeing this in the log -
>>> /var/log/slurmctld:
>>>
>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> [2020-07-21T22:40:03.000] update_node: node node001 reason set to:
>>> Kill task failed
>>>
>>> [2020-07-21T22:40:03.001] update_node: node node001 state set to
>>> DRAINING
>>>
>>> On the compute node, I am seeing this in the log - /var/log/slurmd
>>>
>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> [2020-07-21T22:38:33.110] [1485.batch] done with job
>>>
>>> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to
>>> 1485.4294967295
>>>
>>> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to
>>> 1485.4294967295
>>>
>>> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to
>>> 1485.4294967295
>>>
>>> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR
>>> 1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB
>>> NOT ENDING WITH SIGNALS ***
>>>
>>> I've tried restarting the SLURMD daemon on the compute nodes, and
>>> even completing rebooting a few computes nodes (node001, node002) .
>>>
>>> From what I've seen were experiencing this on all nodes in the cluster.
>>>
>>> I've yet to restart the headnode because there are still active jobs
>>> on the system so I don't want to interrupt those.
>>>
>>> Thank you for your time,
>>>
>>> Ivan
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200722/05ac202c/attachment-0001.htm>
More information about the slurm-users
mailing list