[slurm-users] draining nodes due to failed killing of task?

Diego Zuccato diego.zuccato at unibo.it
Fri Aug 6 12:19:03 UTC 2021


IIRC we increased SlurmdTimeout to 7200 .

Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
> On 8/6/21 1:56 PM, Diego Zuccato wrote:
>> We had a similar problem some time ago (slow creation of big core 
>> files) and solved it by increasing the Slurm timeouts
> oh, i see.. well, in principle i should not have core files, and i do 
> not find any...
> 
>> to the point that even the slowest core wouldn't trigger it. Then, 
>> once the need for core files was over, I disabled core files and 
>> restored the timeouts.
> and how much did you increased them? i have
> SlurmctldTimeout=300
> SlurmdTimeout=300
> 
> Thank you!
> Adrian
> 
> 
>>
>> Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
>>> On 8/6/21 1:27 PM, Diego Zuccato wrote:
>>>> Hi.
>>> Hi!
>>>
>>>> Might it be due to a timeout (maybe the killed job is creating a 
>>>> core file, or caused heavy swap usage)?
>>> i will have to search for culprit ..
>>> the problem is why would the node be put in drain for the reason of 
>>> failed killing? and how can i control/disable
>>> this?
>>>
>>> Thank you!
>>> Adrian
>>>
>>>
>>>>
>>>> BYtE,
>>>>   Diego
>>>>
>>>> Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
>>>>> Having just implemented some triggers i just noticed this:
>>>>>
>>>>> NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY 
>>>>> TMP_DISK WEIGHT AVAIL_FE REASON
>>>>> alien-0-47      1    alien*    draining   48   48:1:1 193324 
>>>>> 214030      1 rack-0,4 Kill task failed
>>>>> alien-0-56      1    alien*     drained   48   48:1:1 193324 
>>>>> 214030      1 rack-0,4 Kill task failed
>>>>>
>>>>> i was wondering why a node is drained when killing of task fails 
>>>>> and how can i disable it? (i use cgroups)
>>>>> moreover, how can the killing of task fails? (this is on slurm 19.05)
>>>>>
>>>>> Thank you!
>>>>> Adrian
>>>>>
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list