[slurm-users] After reboot nodes are in state = down

Rafał Kędziorski rafal.kedziorski at gmail.com
Fri Sep 27 16:37:06 UTC 2019


o.k. thx for the explanation.

Am Fr., 27. Sept. 2019 um 15:38 Uhr schrieb Steffen Grunewald <
steffen.grunewald at aei.mpg.de>:

> On Fri, 2019-09-27 at 14:58:40 +0200, Rafał Kędziorski wrote:
> > Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald <
> > steffen.grunewald at aei.mpg.de>:
> > > On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> > > >
> > > > you may try setting `ReturnToService=2´ in slurm.conf.
> > > >
> > > Caveat: A spontaneously rebooting machine may create a "black hole"
> this
> > > way.
> > >
> > How do you mean this? Could ReturnToService=2 be a problem?
>
> For us it was - we had (and still have) nodes spontaneously rebooting.
> If they come up into idle, they will eat the next job, etc as infinitum -
> thus we've set ReturnToService=0.
>
> "Black hole" in a figurative way, still swallowing all it could get its
> hands on.
>
> You've got to decide what's worse: have full control over machines rebooted
> intentionally, or have full control over misbehaving ones. My own choice
> is clear.
>
> - S
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190927/4a5a111a/attachment.htm>


More information about the slurm-users mailing list