[slurm-users] draining nodes due to failed killing of task?

Diego Zuccato diego.zuccato at unibo.it
Fri Aug 6 10:56:40 UTC 2021


We had a similar problem some time ago (slow creation of big core files) 
and solved it by increasing the Slurm timeouts to the point that even 
the slowest core wouldn't trigger it. Then, once the need for core files 
was over, I disabled core files and restored the timeouts.

Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
> On 8/6/21 1:27 PM, Diego Zuccato wrote:
>> Hi.
> Hi!
> 
>> Might it be due to a timeout (maybe the killed job is creating a core 
>> file, or caused heavy swap usage)?
> i will have to search for culprit ..
> the problem is why would the node be put in drain for the reason of 
> failed killing? and how can i control/disable
> this?
> 
> Thank you!
> Adrian
> 
> 
>>
>> BYtE,
>>   Diego
>>
>> Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
>>> Having just implemented some triggers i just noticed this:
>>>
>>> NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK 
>>> WEIGHT AVAIL_FE REASON
>>> alien-0-47      1    alien*    draining   48   48:1:1 193324 
>>> 214030      1 rack-0,4 Kill task failed
>>> alien-0-56      1    alien*     drained   48   48:1:1 193324 
>>> 214030      1 rack-0,4 Kill task failed
>>>
>>> i was wondering why a node is drained when killing of task fails and 
>>> how can i disable it? (i use cgroups)
>>> moreover, how can the killing of task fails? (this is on slurm 19.05)
>>>
>>> Thank you!
>>> Adrian
>>>
>>>
>>
> 
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list