[slurm-users] draining nodes due to failed killing of task?
b.h.mevik at usit.uio.no
Mon Aug 9 06:15:39 UTC 2021
Adrian Sevcenco <Adrian.Sevcenco at spacescience.ro> writes:
> Having just implemented some triggers i just noticed this:
> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
> alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed
> alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed
> i was wondering why a node is drained when killing of task fails
I guess the heuristic is that something is wrong with the node, so it
should not run more jobs. Like Disk-waits or similar that might require
> and how can i disable it? (i use cgroups)
I don't know how to disable it, but it can be tuned with:
The length of time, in seconds, that Slurm will wait before
deciding that processes in a job step are unkillable (after
they have been signaled with SIGKILL) and execute Unkill‐
ableStepProgram. The default timeout value is 60 seconds.
If exceeded, the compute node will be drained to prevent
future jobs from being scheduled on the node.
(Note though, that according to
https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set
higher than 127 s.)
You might also want to look at this setting to find out what is going on
on the machine when Slurm cannot kill the job step:
If the processes in a job step are determined to be unkill‐
able for a period of time specified by the UnkillableStepTi‐
meout variable, the program specified by UnkillableStepPro‐
gram will be executed. By default no program is run.
See section UNKILLABLE STEP PROGRAM SCRIPT for more informa‐
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 832 bytes
Desc: not available
More information about the slurm-users