[slurm-users] draining nodes due to failed killing of task?

Chris Samuel chris at csamuel.org
Sun Aug 8 00:19:04 UTC 2021


On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:

> i was wondering why a node is drained when killing of task fails and how can
> i disable it? (i use cgroups) moreover, how can the killing of task fails?
> (this is on slurm 19.05)

Slurm has tried to kill processes, but they refuse to go away. Usually this 
means they're stuck in a device or I/O wait for some reason, so look for 
processes that are in a "D" state on the node.

As others have said they can be stuck writing out large files and waiting for 
the kernel to complete that before they exit.  This can also happen if you're 
using GPUs and something has gone wrong in the driver and the process is stuck 
in the kernel somewhere.

You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the 
kernel reports tasks stuck and where they are stuck.

If there are tasks stuck in that state then often the only recourse is to 
reboot the node back into health.

You can tell Slurm to run a program on the node should it find itself in this 
state, see:

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram

Best of luck,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






More information about the slurm-users mailing list