[slurm-users] Apparent scontrol reboot bug

Tue Jan 22 10:20:09 UTC 2019

Hi,

We encounter a strange issue on our system (Slurm 18.08.3), and I'm curious whether anyone of you recognizes this behavior. In the following example we try to reboot 32 nodes, of which 31 nodes are idle:

root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
root# sinfo -o "%100E %9u %19H %N"
REASON                                                                                               USER      TIMESTAMP           NODELIST
image                                                                                                root      2019-01-21T17:03:49 r8n32
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[1-3]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[4-10]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[11-15]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[16-23]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[24-29]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[30-31]

For as long as the allocated node (r8n32) has not been rebooted, the "reboot issued" message keeps appending to the reason for all other nodes, and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in an endless reboot loop. It seems like they keep getting the instruction to reboot. As soon as I cancel the reboot for the allocated node, the reboot loop stops for all other nodes. 

This also happens if we do the reboot command in a loop:

root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume reason=image r8n$n; done

So it seems that Slurm somehow groups all nodes that need to be rebooted together, and issues reboot commands to them until the last one of them is ready to reboot. This happens regardless of whether the scontrol command has been issued for all nodes at once or independently.

I should add that the command works fine if we need to reboot just one node, or for couple of nodes that were already idle to begin with. The RebootProgram is /sbin/reboot, so nothing out of the ordinary.

Best regards,
Martijn Kruiten
-- 
| System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |
| T +31 6 20043417  | martijn.kruiten at surfsara.nl | www.surfsara.nl |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/aa43c42a/attachment.html>