[slurm-users] Issue with x11

Alan Orth alan.orth at gmail.com
Fri May 17 19:47:54 UTC 2019


Dear Christopher,

I tried as you suggested and increased UnkillableStepTimeout from 60 to 120
seconds, but a few hours later three of my nodes were drained with reason
"Kill task failed" again. We're not using cgroups. There is a bug¹ on
SchedMD's tracker describing attempts to understand this error. There they
mention it possibly being related to the new X11 code in SLURM 18.08.

Regards,

¹ https://bugs.schedmd.com/show_bug.cgi?id=6307


On Thu, May 16, 2019 at 7:02 PM Christopher Samuel <chris at csamuel.org>
wrote:

> On 5/16/19 1:04 AM, Alan Orth wrote:
>
> > but now we get a handful of nodes drained every day with reason "Kill
> > task failed". In ten years of using SLURM I've never had so many
> > problems as I'm having now. :\
>
> We see "kill task failed" issues but as Marcus says that's not related
> to X11 support, when we see it it's usually because the kernel cannot
> evict dirty pages from cgroups quickly enough (or at all) for Slurm's
> liking.  You may want to tweak the default timeout for your
> UnkillableStepTimeout from the default of 60 seconds.
>
> All the best,
> Chris
> --
>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>

-- 
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190517/a1a5d11d/attachment.html>


More information about the slurm-users mailing list