[slurm-users] After reboot nodes are in state = down

Juergen Salk juergen.salk at uni-ulm.de
Fri Sep 27 13:32:52 UTC 2019


* Rafał Kędziorski <rafal.kedziorski at gmail.com> [190927 14:58]:
> > >
> > > you may try setting `ReturnToService=2´ in slurm.conf.
> > >
> >
> > Caveat: A spontaneously rebooting machine may create a "black hole" this
> > way.
> >
> 
> How do you mean this? Could ReturnToService=2 be a problem?
> 

Hi Rafał,

black hole syndrom happens when a node constantly accepts new jobs 
and then causes these jobs to fail. This may even flush all jobs 
from the queue for no obvious reason.

As Steffen said, this scenario may also happen if a node accepts a
job, then spontaneously reboots, then accepts the next job, then
reboots again, ...

> > Max Planck Institute for Gravitational Physics (Albert Einstein Institute)

That makes a somewhat funny element in this context. ;-)

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471



More information about the slurm-users mailing list