[slurm-users] Nodes do not return to service after scontrol reboot

David Baker D.J.Baker at soton.ac.uk
Tue Jun 16 15:16:43 UTC 2020


Hello,

We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot..

navy55         1    debug*        down   80   2:20:2 192000        0   2000   (null) Reboot ASAP : reboot

This is a diskfull node and so it doesn't take too long to reboot. For the sake of the argument I have set ResumeTimeOut to 1000 seconds which is well over what's needed...

[root at navy51 slurm]# grep -i resume slurm.conf
ResumeTimeout=1000
[root at navy51 slurm]# grep -i return slurm.conf
ReturnToService=0
[root at navy51 slurm]# grep -i nhc slurm.conf
# LBNL Node Health Check (NHC)
#HealthCheckProgram=/usr/sbin/nhc

For this experiment I have disabled the health checker, and I don't think setting ReturnToService=1 helps. Could anyone please help with this? We are about to update the node firmware and ensuring that the nodes are returned to service following their reboot would be useful.

Best regards,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200616/ed3a8f32/attachment-0001.htm>


More information about the slurm-users mailing list