[slurm-users] draining nodes due to failed killing of task?
Chris Samuel
chris at csamuel.org
Sun Aug 8 00:19:04 UTC 2021
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:
> i was wondering why a node is drained when killing of task fails and how can
> i disable it? (i use cgroups) moreover, how can the killing of task fails?
> (this is on slurm 19.05)
Slurm has tried to kill processes, but they refuse to go away. Usually this
means they're stuck in a device or I/O wait for some reason, so look for
processes that are in a "D" state on the node.
As others have said they can be stuck writing out large files and waiting for
the kernel to complete that before they exit. This can also happen if you're
using GPUs and something has gone wrong in the driver and the process is stuck
in the kernel somewhere.
You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the
kernel reports tasks stuck and where they are stuck.
If there are tasks stuck in that state then often the only recourse is to
reboot the node back into health.
You can tell Slurm to run a program on the node should it find itself in this
state, see:
https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list