[slurm-users] Nodes going into drain because of "Kill task failed"
mercan
ahmet.mercan at uhem.itu.edu.tr
Thu Jul 23 16:31:53 UTC 2020
Hi;
Are you sure this is a job task completing issue. When the epilog script
fails, slurm will set node to DRAIN state:
"If the Epilog fails (returns a non-zero exit code), this will result in
the node being set to a DRAIN state"
https://slurm.schedmd.com/prolog_epilog.html
You can test this possibility by adding a "exit 0" line at the end of
the epilog script.
Regards;
Ahmet M.
23.07.2020 18:34 tarihinde Ivan Kovanda yazdı:
>
> Thanks for the input guys!
>
> We don’t even use lustre filesystems…and It doesn’t appear to be I/O.
>
> I execute *iostat* on both head node and compute node when the job is
> in CG status and the %iowait value is 0.00 or 0.01
>
> $ iostat
>
> Linux 3.10.0-957.el7.x86_64 (node002) 07/22/2020 _x86_64_
> (32 CPU)
>
> avg-cpu: %user %nice %system %iowait %steal %idle
>
> 0.01 0.00 0.01 0.00 0.00 99.98
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
>
> sda 0.82 14.09 2.39 1157160 196648
>
> Also tried the following command to see if I can identify any
> processes in D state on the compute node but no results:
>
> ps aux | awk '$8 ~ /D/ { print $0 }'
>
> This ones got me stumped…
>
> Sorry I’m not too familiar with epilog yet; do you have any examples
> of how I would use that to log the SIGKILL event ?
>
> Thanks again,
> Ivan
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
> Of *Paul Edmon
> *Sent:* Thursday, July 23, 2020 7:19 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] Nodes going into drain because of "Kill
> task failed"
>
> Same here. Whenever we see rashes of Kill task failed it is
> invariably symptomatic of one of our Lustre filesystems acting up or
> being saturated.
>
> -Paul Edmon-
>
> On 7/22/2020 3:21 PM, Ryan Cox wrote:
>
> Angelos,
>
> I'm glad you mentioned UnkillableStepProgram. We meant to look at
> that a while ago but forgot about it. That will be very useful
> for us as well, though the answer for us is pretty much always
> Lustre problems.
>
> Ryan
>
> On 7/22/20 1:02 PM, Angelos Ching wrote:
>
> Agreed. You may also want to write a script that gather the
> list of program in "D state" (kernel wait) and print their
> stack; and configure it as UnkillableStepProgram so that you
> can capture the program and relevant system callS that caused
> the job to become unkillable / timed out exiting for further
> troubleshooting.
>
>
> Regards,
>
> Angelos
>
> (Sent from mobile, please pardon me for typos and cursoriness.)
>
>
>
> 2020/07/23 0:41、Ryan Cox <ryan_cox at byu.edu>
> <mailto:ryan_cox at byu.edu>のメール:
>
> Ivan,
>
> Are you having I/O slowness? That is the most common cause
> for us. If it's not that, you'll want to look through all
> the reasons that it takes a long time for a process to
> actually die after a SIGKILL because one of those is the
> likely cause. Typically it's because the process is
> waiting for an I/O syscall to return. Sometimes swap death
> is the culprit, but usually not at the scale that you
> stated. Maybe you could try reproducing the issue
> manually or putting something in epilog the see the state
> of the processes in the job's cgroup.
>
> Ryan
>
> On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>
> Dear slurm community,
>
> Currently running slurm version 18.08.4
>
> We have been experiencing an issue causing any nodes a
> slurm job was submitted to to "drain".
>
> From what I've seen, it appears that there is a
> problem with how slurm is cleaning up the job with the
> SIGKILL process.
>
> I've found this slurm article
> (https://slurm.schedmd.com/troubleshoot.html#completing
> <https://urldefense.com/v3/__https:/slurm.schedmd.com/troubleshoot.html*completing__;Iw!!NCZxaNi9jForCP_SxBKJCA!FOsRehxg6w3PLipsOItVBSjYhPtRzmQnBUQen6C13v85kgef1cZFdtwuP9zG1sgAEQ$>)
> , which has a section titled "Jobs and nodes are stuck
> in COMPLETING state", where it recommends increasing
> the "UnkillableStepTimeout" in the slurm.conf , but
> all that has done is prolong the time it takes for the
> job to timeout.
>
> The default time for the "UnkillableStepTimeout" is 60
> seconds.
>
> After the job completes, it stays in the CG
> (completing) status for the 60 seconds, then the nodes
> the job was submitted to go to drain status.
>
> On the headnode running slurmctld, I am seeing this in
> the log - /var/log/slurmctld:
>
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> [2020-07-21T22:40:03.000] update_node: node node001
> reason set to: Kill task failed
>
> [2020-07-21T22:40:03.001] update_node: node node001
> state set to DRAINING
>
> On the compute node, I am seeing this in the log -
> /var/log/slurmd
>
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> [2020-07-21T22:38:33.110] [1485.batch] done with job
>
> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18
> to 1485.4294967295
>
> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15
> to 1485.4294967295
>
> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL
> signal to 1485.4294967295
>
> [2020-07-21T22:40:03.000] [1485.extern] error: ***
> EXTERN STEP FOR 1485 STEPD TERMINATED ON node001 AT
> 2020-07-21T22:40:02 DUE TO JOB NOT ENDING WITH SIGNALS ***
>
> I've tried restarting the SLURMD daemon on the compute
> nodes, and even completing rebooting a few computes
> nodes (node001, node002) .
>
> From what I've seen were experiencing this on all
> nodes in the cluster.
>
> I've yet to restart the headnode because there are
> still active jobs on the system so I don't want to
> interrupt those.
>
> Thank you for your time,
>
> Ivan
>
More information about the slurm-users
mailing list