[slurm-users] draining nodes due to failed killing of task?
Adrian Sevcenco
Adrian.Sevcenco at spacescience.ro
Sun Aug 8 06:47:06 UTC 2021
Hi!
On 8/8/21 3:19 AM, Chris Samuel wrote:
> On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:
>
>> i was wondering why a node is drained when killing of task fails and how can
>> i disable it? (i use cgroups) moreover, how can the killing of task fails?
>> (this is on slurm 19.05)
>
> Slurm has tried to kill processes, but they refuse to go away. Usually this
> means they're stuck in a device or I/O wait for some reason, so look for
> processes that are in a "D" state on the node.
yes, the jobs that are running have a part of file saving if they are killed,
saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..
> As others have said they can be stuck writing out large files and waiting for
> the kernel to complete that before they exit. This can also happen if you're
> using GPUs and something has gone wrong in the driver and the process is stuck
> in the kernel somewhere.
>
> You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the
> kernel reports tasks stuck and where they are stuck.
>
> If there are tasks stuck in that state then often the only recourse is to
> reboot the node back into health.
yeah, this would be bad as is also the move to draining .. i use batch jobs
and i can have 128 different jobs on a single node .. i will see if i can increase
some timeouts
> You can tell Slurm to run a program on the node should it find itself in this
> state, see:
>
> https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram
oh, thanks for the hint, i glossed over this when looking over documentation
in first approximation i can make this a tool for reporting what is going on
and later add some actions..
Thanks a lot!
Adrian
More information about the slurm-users
mailing list