[slurm-users] draining nodes due to failed killing of task?

Fri Aug 6 10:46:53 UTC 2021

On 8/6/21 1:27 PM, Diego Zuccato wrote:
> Hi.
Hi!

> Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)?
i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable
this?

Thank you!
Adrian

> 
> BYtE,
>   Diego
> 
> Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
>> Having just implemented some triggers i just noticed this:
>>
>> NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
>> alien-0-47      1    alien*    draining   48   48:1:1 193324 214030      1 rack-0,4 Kill task failed
>> alien-0-56      1    alien*     drained   48   48:1:1 193324 214030      1 rack-0,4 Kill task failed
>>
>> i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups)
>> moreover, how can the killing of task fails? (this is on slurm 19.05)
>>
>> Thank you!
>> Adrian
>>
>>
>