[slurm-users] Nodes going into drain because of "Kill task failed"

Ryan Cox ryan_cox at byu.edu
Wed Jul 22 16:39:03 UTC 2020


Ivan,

Are you having I/O slowness? That is the most common cause for us. If 
it's not that, you'll want to look through all the reasons that it takes 
a long time for a process to actually die after a SIGKILL because one of 
those is the likely cause. Typically it's because the process is waiting 
for an I/O syscall to return. Sometimes swap death is the culprit, but 
usually not at the scale that you stated. Maybe you could try 
reproducing the issue manually or putting something in epilog the see 
the state of the processes in the job's cgroup.

Ryan

On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>
> Dear slurm community,
>
> Currently running slurm version 18.08.4
>
> We have been experiencing an issue causing any nodes a slurm job was 
> submitted to to "drain".
>
> From what I've seen, it appears that there is a problem with how slurm 
> is cleaning up the job with the SIGKILL process.
>
> I've found this slurm article 
> (https://slurm.schedmd.com/troubleshoot.html#completing) , which has a 
> section titled "Jobs and nodes are stuck in COMPLETING state", where 
> it recommends increasing the "UnkillableStepTimeout" in the slurm.conf 
> , but all that has done is prolong the time it takes for the job to 
> timeout.
>
> The default time for the "UnkillableStepTimeout" is 60 seconds.
>
> After the job completes, it stays in the CG (completing) status for 
> the 60 seconds, then the nodes the job was submitted to go to drain 
> status.
>
> On the headnode running slurmctld, I am seeing this in the log - 
> /var/log/slurmctld:
>
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> [2020-07-21T22:40:03.000] update_node: node node001 reason set to: 
> Kill task failed
>
> [2020-07-21T22:40:03.001] update_node: node node001 state set to DRAINING
>
> On the compute node, I am seeing this in the log - /var/log/slurmd
>
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> [2020-07-21T22:38:33.110] [1485.batch] done with job
>
> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 1485.4294967295
>
> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 1485.4294967295
>
> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 
> 1485.4294967295
>
> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 
> 1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT 
> ENDING WITH SIGNALS ***
>
> I've tried restarting the SLURMD daemon on the compute nodes, and even 
> completing rebooting a few computes nodes (node001, node002) .
>
> From what I've seen were experiencing this on all nodes in the cluster.
>
> I've yet to restart the headnode because there are still active jobs 
> on the system so I don't want to interrupt those.
>
> Thank you for your time,
>
> Ivan
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200722/246bd6ec/attachment-0001.htm>


More information about the slurm-users mailing list