[slurm-users] Issue with x11

Alan Orth alan.orth at gmail.com
Fri May 17 19:47:54 UTC 2019

Dear Christopher,

I tried as you suggested and increased UnkillableStepTimeout from 60 to 120
seconds, but a few hours later three of my nodes were drained with reason
"Kill task failed" again. We're not using cgroups. There is a bug¹ on
SchedMD's tracker describing attempts to understand this error. There they
mention it possibly being related to the new X11 code in SLURM 18.08.


¹ https://bugs.schedmd.com/show_bug.cgi?id=6307

On Thu, May 16, 2019 at 7:02 PM Christopher Samuel <chris at csamuel.org>

> On 5/16/19 1:04 AM, Alan Orth wrote:
> > but now we get a handful of nodes drained every day with reason "Kill
> > task failed". In ten years of using SLURM I've never had so many
> > problems as I'm having now. :\
> We see "kill task failed" issues but as Marcus says that's not related
> to X11 support, when we see it it's usually because the kernel cannot
> evict dirty pages from cgroups quickly enough (or at all) for Slurm's
> liking.  You may want to tweak the default timeout for your
> UnkillableStepTimeout from the default of 60 seconds.
> All the best,
> Chris
> --
>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Alan Orth
alan.orth at gmail.com
"In heaven all the interesting people are missing." ―Friedrich Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190517/a1a5d11d/attachment.html>

More information about the slurm-users mailing list