[slurm-users] After reboot nodes are in state = down

Steffen Grunewald steffen.grunewald at aei.mpg.de
Fri Sep 27 13:36:32 UTC 2019


On Fri, 2019-09-27 at 14:58:40 +0200, Rafał Kędziorski wrote:
> Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald <
> steffen.grunewald at aei.mpg.de>:
> > On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> > >
> > > you may try setting `ReturnToService=2´ in slurm.conf.
> > >
> > Caveat: A spontaneously rebooting machine may create a "black hole" this
> > way.
> >
> How do you mean this? Could ReturnToService=2 be a problem?

For us it was - we had (and still have) nodes spontaneously rebooting.
If they come up into idle, they will eat the next job, etc as infinitum -
thus we've set ReturnToService=0.

"Black hole" in a figurative way, still swallowing all it could get its hands on.

You've got to decide what's worse: have full control over machines rebooted
intentionally, or have full control over misbehaving ones. My own choice is clear.

- S



More information about the slurm-users mailing list