[slurm-users] draining nodes due to failed killing of task?
Willy Markuske
wmarkuske at sdsc.edu
Fri Aug 6 15:06:21 UTC 2021
Adrian and Diego,
Are you using AMD Epyc processors when viewing this issue? I've been
having the same issue but only on dual AMD Epyc systems. I haven't tried
changing the core file location from an NFS mount though so perhaps
there's an issue writing it out in time.
How did you disable core files?
Regards,
Willy Markuske
HPC Systems Engineer
Research Data Services
P: (619) 519-4435
On 8/6/21 6:16 AM, Adrian Sevcenco wrote:
> On 8/6/21 3:19 PM, Diego Zuccato wrote:
>> IIRC we increased SlurmdTimeout to 7200 .
> Thanks a lot!
>
> Adrian
>
>>
>> Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
>>> On 8/6/21 1:56 PM, Diego Zuccato wrote:
>>>> We had a similar problem some time ago (slow creation of big core
>>>> files) and solved it by increasing the Slurm timeouts
>>> oh, i see.. well, in principle i should not have core files, and i
>>> do not find any...
>>>
>>>> to the point that even the slowest core wouldn't trigger it. Then,
>>>> once the need for core files was over, I disabled core files and
>>>> restored the timeouts.
>>> and how much did you increased them? i have
>>> SlurmctldTimeout=300
>>> SlurmdTimeout=300
>>>
>>> Thank you!
>>> Adrian
>>>
>>>
>>>>
>>>> Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
>>>>> On 8/6/21 1:27 PM, Diego Zuccato wrote:
>>>>>> Hi.
>>>>> Hi!
>>>>>
>>>>>> Might it be due to a timeout (maybe the killed job is creating a
>>>>>> core file, or caused heavy swap usage)?
>>>>> i will have to search for culprit ..
>>>>> the problem is why would the node be put in drain for the reason
>>>>> of failed killing? and how can i control/disable
>>>>> this?
>>>>>
>>>>> Thank you!
>>>>> Adrian
>>>>>
>>>>>
>>>>>>
>>>>>> BYtE,
>>>>>> Diego
>>>>>>
>>>>>> Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
>>>>>>> Having just implemented some triggers i just noticed this:
>>>>>>>
>>>>>>> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY
>>>>>>> TMP_DISK WEIGHT AVAIL_FE REASON
>>>>>>> alien-0-47 1 alien* draining 48 48:1:1 193324
>>>>>>> 214030 1 rack-0,4 Kill task failed
>>>>>>> alien-0-56 1 alien* drained 48 48:1:1 193324
>>>>>>> 214030 1 rack-0,4 Kill task failed
>>>>>>>
>>>>>>> i was wondering why a node is drained when killing of task fails
>>>>>>> and how can i disable it? (i use cgroups)
>>>>>>> moreover, how can the killing of task fails? (this is on slurm
>>>>>>> 19.05)
>>>>>>>
>>>>>>> Thank you!
>>>>>>> Adrian
>>>>>>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210806/9a8305c4/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSClogo-plusname-red.jpg
Type: image/jpeg
Size: 9464 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210806/9a8305c4/attachment-0001.jpg>
More information about the slurm-users
mailing list