[slurm-users] draining nodes due to failed killing of task?

Sun Aug 8 06:47:06 UTC 2021

Hi!

On 8/8/21 3:19 AM, Chris Samuel wrote:
> On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:
> 
>> i was wondering why a node is drained when killing of task fails and how can
>> i disable it? (i use cgroups) moreover, how can the killing of task fails?
>> (this is on slurm 19.05)
> 
> Slurm has tried to kill processes, but they refuse to go away. Usually this
> means they're stuck in a device or I/O wait for some reason, so look for
> processes that are in a "D" state on the node.
yes, the jobs that are running have a part of file saving if they are killed,
saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..

> As others have said they can be stuck writing out large files and waiting for
> the kernel to complete that before they exit.  This can also happen if you're
> using GPUs and something has gone wrong in the driver and the process is stuck
> in the kernel somewhere.
> 
> You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the
> kernel reports tasks stuck and where they are stuck.
> 
> If there are tasks stuck in that state then often the only recourse is to
> reboot the node back into health.
yeah, this would be bad as is also the move to draining ..  i use batch jobs
and i can have 128 different jobs on a single node .. i will see if i can increase
some timeouts

> You can tell Slurm to run a program on the node should it find itself in this
> state, see:
> 
> https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram
oh, thanks for the hint, i glossed over this when looking over documentation
in first approximation i can make this a tool for reporting what is going on
and later add some actions..

Thanks a lot!
Adrian