<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Adrian and Diego,</p>
<p>Are you using AMD Epyc processors when viewing this issue? I've
been having the same issue but only on dual AMD Epyc systems. I
haven't tried changing the core file location from an NFS mount
though so perhaps there's an issue writing it out in time.</p>
<p>How did you disable core files?</p>
<p>Regards,<br>
</p>
<div class="moz-signature">
<title></title>
<table cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="150" valign="top" height="30" align="left">
<p style="font-size:14px;">Willy Markuske</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">HPC Systems Engineer</p>
</td>
<td rowspan="3" width="180" valign="center" height="42" align="center"><tt><img moz-do-not-send="false" src="cid:part1.740BAA87.EF586C76@sdsc.edu" alt="" width="168" height="48"></tt> </td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">Research Data Services</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">P: (619) 519-4435</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
</div>
<div class="moz-cite-prefix">On 8/6/21 6:16 AM, Adrian Sevcenco
wrote:<br>
</div>
<blockquote type="cite" cite="mid:b6dd2426-b4d4-19e2-4a45-7aeabc407451@spacescience.ro">On
8/6/21 3:19 PM, Diego Zuccato wrote:
<br>
<blockquote type="cite">IIRC we increased SlurmdTimeout to 7200 .
<br>
</blockquote>
Thanks a lot!
<br>
<br>
Adrian
<br>
<br>
<blockquote type="cite">
<br>
Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
<br>
<blockquote type="cite">On 8/6/21 1:56 PM, Diego Zuccato wrote:
<br>
<blockquote type="cite">We had a similar problem some time ago
(slow creation of big core files) and solved it by
increasing the Slurm timeouts
<br>
</blockquote>
oh, i see.. well, in principle i should not have core files,
and i do not find any...
<br>
<br>
<blockquote type="cite">to the point that even the slowest
core wouldn't trigger it. Then, once the need for core files
was over, I disabled core files and restored the timeouts.
<br>
</blockquote>
and how much did you increased them? i have
<br>
SlurmctldTimeout=300
<br>
SlurmdTimeout=300
<br>
<br>
Thank you!
<br>
Adrian
<br>
<br>
<br>
<blockquote type="cite">
<br>
Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
<br>
<blockquote type="cite">On 8/6/21 1:27 PM, Diego Zuccato
wrote:
<br>
<blockquote type="cite">Hi.
<br>
</blockquote>
Hi!
<br>
<br>
<blockquote type="cite">Might it be due to a timeout
(maybe the killed job is creating a core file, or caused
heavy swap usage)?
<br>
</blockquote>
i will have to search for culprit ..
<br>
the problem is why would the node be put in drain for the
reason of failed killing? and how can i control/disable
<br>
this?
<br>
<br>
Thank you!
<br>
Adrian
<br>
<br>
<br>
<blockquote type="cite">
<br>
BYtE,
<br>
Diego
<br>
<br>
Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
<br>
<blockquote type="cite">Having just implemented some
triggers i just noticed this:
<br>
<br>
NODELIST NODES PARTITION STATE CPUS S:C:T
MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
<br>
alien-0-47 1 alien* draining 48 48:1:1
193324 214030 1 rack-0,4 Kill task failed
<br>
alien-0-56 1 alien* drained 48 48:1:1
193324 214030 1 rack-0,4 Kill task failed
<br>
<br>
i was wondering why a node is drained when killing of
task fails and how can i disable it? (i use cgroups)
<br>
moreover, how can the killing of task fails? (this is
on slurm 19.05)
<br>
<br>
Thank you!
<br>
Adrian
<br>
<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br>
<br>
</blockquote>
</body>
</html>