[slurm-users] draining nodes due to failed killing of task?

Adrian Sevcenco Adrian.Sevcenco at spacescience.ro
Fri Aug 6 13:16:48 UTC 2021


On 8/6/21 3:19 PM, Diego Zuccato wrote:
> IIRC we increased SlurmdTimeout to 7200 .
Thanks a lot!

Adrian

> 
> Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
>> On 8/6/21 1:56 PM, Diego Zuccato wrote:
>>> We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts
>> oh, i see.. well, in principle i should not have core files, and i do not find any...
>>
>>> to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled 
>>> core files and restored the timeouts.
>> and how much did you increased them? i have
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>>
>> Thank you!
>> Adrian
>>
>>
>>>
>>> Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
>>>> On 8/6/21 1:27 PM, Diego Zuccato wrote:
>>>>> Hi.
>>>> Hi!
>>>>
>>>>> Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)?
>>>> i will have to search for culprit ..
>>>> the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable
>>>> this?
>>>>
>>>> Thank you!
>>>> Adrian
>>>>
>>>>
>>>>>
>>>>> BYtE,
>>>>>   Diego
>>>>>
>>>>> Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
>>>>>> Having just implemented some triggers i just noticed this:
>>>>>>
>>>>>> NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
>>>>>> alien-0-47      1    alien*    draining   48   48:1:1 193324 214030      1 rack-0,4 Kill task failed
>>>>>> alien-0-56      1    alien*     drained   48   48:1:1 193324 214030      1 rack-0,4 Kill task failed
>>>>>>
>>>>>> i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups)
>>>>>> moreover, how can the killing of task fails? (this is on slurm 19.05)
>>>>>>
>>>>>> Thank you!
>>>>>> Adrian
>>>>>>




More information about the slurm-users mailing list