[slurm-users] Issue with x11
Alan Orth
alan.orth at gmail.com
Fri May 17 19:47:54 UTC 2019
Dear Christopher,
I tried as you suggested and increased UnkillableStepTimeout from 60 to 120
seconds, but a few hours later three of my nodes were drained with reason
"Kill task failed" again. We're not using cgroups. There is a bug¹ on
SchedMD's tracker describing attempts to understand this error. There they
mention it possibly being related to the new X11 code in SLURM 18.08.
Regards,
¹ https://bugs.schedmd.com/show_bug.cgi?id=6307
On Thu, May 16, 2019 at 7:02 PM Christopher Samuel <chris at csamuel.org>
wrote:
> On 5/16/19 1:04 AM, Alan Orth wrote:
>
> > but now we get a handful of nodes drained every day with reason "Kill
> > task failed". In ten years of using SLURM I've never had so many
> > problems as I'm having now. :\
>
> We see "kill task failed" issues but as Marcus says that's not related
> to X11 support, when we see it it's usually because the kernel cannot
> evict dirty pages from cgroups quickly enough (or at all) for Slurm's
> liking. You may want to tweak the default timeout for your
> UnkillableStepTimeout from the default of 60 seconds.
>
> All the best,
> Chris
> --
> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
>
>
--
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190517/a1a5d11d/attachment.html>
More information about the slurm-users
mailing list