[slurm-users] Nodes going into drain because of "Kill task failed"

Wed Jul 22 19:21:13 UTC 2020

Angelos,

I'm glad you mentioned UnkillableStepProgram.  We meant to look at that 
a while ago but forgot about it.  That will be very useful for us as 
well, though the answer for us is pretty much always Lustre problems.

Ryan

On 7/22/20 1:02 PM, Angelos Ching wrote:
> Agreed. You may also want to write a script that gather the list of 
> program in "D state" (kernel wait) and print their stack; and 
> configure it as UnkillableStepProgram so that you can capture the 
> program and relevant system callS that caused the job to become 
> unkillable / timed out exiting for further troubleshooting.
>
> Regards,
> Angelos
> (Sent from mobile, please pardon me for typos and cursoriness.)
>
>> 2020/07/23 0:41、Ryan Cox <ryan_cox at byu.edu>のメール:
>>
>>  Ivan,
>>
>> Are you having I/O slowness? That is the most common cause for us. If 
>> it's not that, you'll want to look through all the reasons that it 
>> takes a long time for a process to actually die after a SIGKILL 
>> because one of those is the likely cause. Typically it's because the 
>> process is waiting for an I/O syscall to return. Sometimes swap death 
>> is the culprit, but usually not at the scale that you stated. Maybe 
>> you could try reproducing the issue manually or putting something in 
>> epilog the see the state of the processes in the job's cgroup.
>>
>> Ryan
>>
>> On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>>>
>>> Dear slurm community,
>>>
>>> Currently running slurm version 18.08.4
>>>
>>> We have been experiencing an issue causing any nodes a slurm job was 
>>> submitted to to "drain".
>>>
>>> From what I've seen, it appears that there is a problem with how 
>>> slurm is cleaning up the job with the SIGKILL process.
>>>
>>> I've found this slurm article 
>>> (https://slurm.schedmd.com/troubleshoot.html#completing) , which has 
>>> a section titled "Jobs and nodes are stuck in COMPLETING state", 
>>> where it recommends increasing the "UnkillableStepTimeout" in the 
>>> slurm.conf , but all that has done is prolong the time it takes for 
>>> the job to timeout.
>>>
>>> The default time for the "UnkillableStepTimeout" is 60 seconds.
>>>
>>> After the job completes, it stays in the CG (completing) status for 
>>> the 60 seconds, then the nodes the job was submitted to go to drain 
>>> status.
>>>
>>> On the headnode running slurmctld, I am seeing this in the log - 
>>> /var/log/slurmctld:
>>>
>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> [2020-07-21T22:40:03.000] update_node: node node001 reason set to: 
>>> Kill task failed
>>>
>>> [2020-07-21T22:40:03.001] update_node: node node001 state set to 
>>> DRAINING
>>>
>>> On the compute node, I am seeing this in the log - /var/log/slurmd
>>>
>>> --------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> [2020-07-21T22:38:33.110] [1485.batch] done with job
>>>
>>> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 
>>> 1485.4294967295
>>>
>>> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 
>>> 1485.4294967295
>>>
>>> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 
>>> 1485.4294967295
>>>
>>> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 
>>> 1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB 
>>> NOT ENDING WITH SIGNALS ***
>>>
>>> I've tried restarting the SLURMD daemon on the compute nodes, and 
>>> even completing rebooting a few computes nodes (node001, node002) .
>>>
>>> From what I've seen were experiencing this on all nodes in the cluster.
>>>
>>> I've yet to restart the headnode because there are still active jobs 
>>> on the system so I don't want to interrupt those.
>>>
>>> Thank you for your time,
>>>
>>> Ivan
>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200722/05ac202c/attachment-0001.htm>