[slurm-users] draining nodes due to failed killing of task?

Adrian Sevcenco Adrian.Sevcenco at spacescience.ro
Sat Aug 7 19:21:39 UTC 2021


On 8/6/21 6:06 PM, Willy Markuske wrote:
> Adrian and Diego,
Hi!

> Are you using AMD Epyc processors when viewing this issue? I've been having the same issue but only on dual AMD Epyc 
i do have some epyc nodes, but the cpu proportion is 50%/50% with broadwell cores ..
and i do not see a correlation/preference of the problem for the epyc ones

> systems. I haven't tried changing the core file location from an NFS mount though so perhaps there's an issue writing it 
> out in time.
> 
> How did you disable core files?
to tell the trouth i do not know at this moment :)) i have to search in conf files,
but i see that :
[root at alien ~]# ulimit -a | grep core
core file size          (blocks, -c) 0

you can either add to /etc/security/limits.d/
a file with:
* hard core 0

and/or:
ulimit -S -c 0

HTH,
Adrian


> 
> Regards,
> 
> Willy Markuske
> 
> HPC Systems Engineer
> 
> 	
> 
> Research Data Services
> 
> P: (619) 519-4435
> 
> On 8/6/21 6:16 AM, Adrian Sevcenco wrote:
>> On 8/6/21 3:19 PM, Diego Zuccato wrote:
>>> IIRC we increased SlurmdTimeout to 7200 .
>> Thanks a lot!
>>
>> Adrian
>>
>>>
>>> Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
>>>> On 8/6/21 1:56 PM, Diego Zuccato wrote:
>>>>> We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm 
>>>>> timeouts
>>>> oh, i see.. well, in principle i should not have core files, and i do not find any...
>>>>
>>>>> to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I 
>>>>> disabled core files and restored the timeouts.
>>>> and how much did you increased them? i have
>>>> SlurmctldTimeout=300
>>>> SlurmdTimeout=300
>>>>
>>>> Thank you!
>>>> Adrian
>>>>
>>>>
>>>>>
>>>>> Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
>>>>>> On 8/6/21 1:27 PM, Diego Zuccato wrote:
>>>>>>> Hi.
>>>>>> Hi!
>>>>>>
>>>>>>> Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)?
>>>>>> i will have to search for culprit ..
>>>>>> the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable
>>>>>> this?
>>>>>>
>>>>>> Thank you!
>>>>>> Adrian
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> BYtE,
>>>>>>>   Diego
>>>>>>>
>>>>>>> Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
>>>>>>>> Having just implemented some triggers i just noticed this:
>>>>>>>>
>>>>>>>> NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
>>>>>>>> alien-0-47      1    alien*    draining   48   48:1:1 193324 214030      1 rack-0,4 Kill task failed
>>>>>>>> alien-0-56      1    alien*     drained   48   48:1:1 193324 214030      1 rack-0,4 Kill task failed
>>>>>>>>
>>>>>>>> i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups)
>>>>>>>> moreover, how can the killing of task fails? (this is on slurm 19.05)
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>> Adrian
>>>>>>>>
>>
>>


-- 
----------------------------------------------
Adrian Sevcenco, Ph.D.                       |
Institute of Space Science - ISS, Romania    |
adrian.sevcenco at {cern.ch,spacescience.ro} |
----------------------------------------------




More information about the slurm-users mailing list