[slurm-users] Nodes going into drain because of "Kill task failed"

Thu Jul 23 16:31:53 UTC 2020

Hi;

Are you sure this is a job task completing issue. When the epilog script 
fails, slurm will set node to DRAIN state:

"If the Epilog fails (returns a non-zero exit code), this will result in 
the node being set to a DRAIN state"

https://slurm.schedmd.com/prolog_epilog.html

You can test this possibility by adding a "exit 0" line at the end of 
the epilog script.

Regards;

Ahmet M.

23.07.2020 18:34 tarihinde Ivan Kovanda yazdı:
>
> Thanks for the input guys!
>
> We don’t even use lustre filesystems…and It doesn’t appear to be I/O.
>
> I execute *iostat* on both head node and compute node when the job is 
> in CG status and the %iowait value is 0.00 or 0.01
>
> $ iostat
>
> Linux 3.10.0-957.el7.x86_64 (node002)   07/22/2020      _x86_64_       
>  (32 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>            0.01    0.00    0.01 0.00    0.00   99.98
>
> Device:            tps kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>
> sda               0.82 14.09         2.39    1157160     196648
>
> Also tried the following command to see if I can identify any 
> processes in D state on the compute node but no results:
>
> ps aux | awk '$8 ~ /D/  { print $0 }'
>
> This ones got me stumped…
>
> Sorry I’m not too familiar with epilog yet; do you have any examples 
> of how I would use that to log the SIGKILL event ?
>
> Thanks again,
> Ivan
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf 
> Of *Paul Edmon
> *Sent:* Thursday, July 23, 2020 7:19 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] Nodes going into drain because of "Kill 
> task failed"
>
> Same here.  Whenever we see rashes of Kill task failed it is 
> invariably symptomatic of one of our Lustre filesystems acting up or 
> being saturated.
>
> -Paul Edmon-
>
> On 7/22/2020 3:21 PM, Ryan Cox wrote:
>
>     Angelos,
>
>     I'm glad you mentioned UnkillableStepProgram.  We meant to look at
>     that a while ago but forgot about it.  That will be very useful
>     for us as well, though the answer for us is pretty much always
>     Lustre problems.
>
>     Ryan
>
>     On 7/22/20 1:02 PM, Angelos Ching wrote:
>
>         Agreed. You may also want to write a script that gather the
>         list of program in "D state" (kernel wait) and print their
>         stack; and configure it as UnkillableStepProgram so that you
>         can capture the program and relevant system callS that caused
>         the job to become unkillable / timed out exiting for further
>         troubleshooting.
>
>
>         Regards,
>
>         Angelos
>
>         (Sent from mobile, please pardon me for typos and cursoriness.)
>
>
>
>             2020/07/23 0:41、Ryan Cox <ryan_cox at byu.edu>
>             <mailto:ryan_cox at byu.edu>のメール:
>
>              Ivan,
>
>             Are you having I/O slowness? That is the most common cause
>             for us. If it's not that, you'll want to look through all
>             the reasons that it takes a long time for a process to
>             actually die after a SIGKILL because one of those is the
>             likely cause. Typically it's because the process is
>             waiting for an I/O syscall to return. Sometimes swap death
>             is the culprit, but usually not at the scale that you
>             stated.  Maybe you could try reproducing the issue
>             manually or putting something in epilog the see the state
>             of the processes in the job's cgroup.
>
>             Ryan
>
>             On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>
>                 Dear slurm community,
>
>                 Currently running slurm version 18.08.4
>
>                 We have been experiencing an issue causing any nodes a
>                 slurm job was submitted to to "drain".
>
>                 From what I've seen, it appears that there is a
>                 problem with how slurm is cleaning up the job with the
>                 SIGKILL process.
>
>                 I've found this slurm article
>                 (https://slurm.schedmd.com/troubleshoot.html#completing
>                 <https://urldefense.com/v3/__https:/slurm.schedmd.com/troubleshoot.html*completing__;Iw!!NCZxaNi9jForCP_SxBKJCA!FOsRehxg6w3PLipsOItVBSjYhPtRzmQnBUQen6C13v85kgef1cZFdtwuP9zG1sgAEQ$>)
>                 , which has a section titled "Jobs and nodes are stuck
>                 in COMPLETING state", where it recommends increasing
>                 the "UnkillableStepTimeout" in the slurm.conf , but
>                 all that has done is prolong the time it takes for the
>                 job to timeout.
>
>                 The default time for the "UnkillableStepTimeout" is 60
>                 seconds.
>
>                 After the job completes, it stays in the CG
>                 (completing) status for the 60 seconds, then the nodes
>                 the job was submitted to go to drain status.
>
>                 On the headnode running slurmctld, I am seeing this in
>                 the log - /var/log/slurmctld:
>
>                 --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 [2020-07-21T22:40:03.000] update_node: node node001
>                 reason set to: Kill task failed
>
>                 [2020-07-21T22:40:03.001] update_node: node node001
>                 state set to DRAINING
>
>                 On the compute node, I am seeing this in the log -
>                 /var/log/slurmd
>
>                 --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 [2020-07-21T22:38:33.110] [1485.batch] done with job
>
>                 [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18
>                 to 1485.4294967295
>
>                 [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15
>                 to 1485.4294967295
>
>                 [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL
>                 signal to 1485.4294967295
>
>                 [2020-07-21T22:40:03.000] [1485.extern] error: ***
>                 EXTERN STEP FOR 1485 STEPD TERMINATED ON node001 AT
>                 2020-07-21T22:40:02 DUE TO JOB NOT ENDING WITH SIGNALS ***
>
>                 I've tried restarting the SLURMD daemon on the compute
>                 nodes, and even completing rebooting a few computes
>                 nodes (node001, node002) .
>
>                 From what I've seen were experiencing this on all
>                 nodes in the cluster.
>
>                 I've yet to restart the headnode because there are
>                 still active jobs on the system so I don't want to
>                 interrupt those.
>
>                 Thank you for your time,
>
>                 Ivan
>