[slurm-users] Apparent scontrol reboot ASAP bug

Mon Jan 21 16:55:49 UTC 2019

Hi,

We encounter a strange issue on our system (Slurm 18.08.3), and I'm
curious whether anyone of you recognizes this behavior. In the
following example we try to reboot 32 nodes, of which 31 nodes are
idle:

root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
root# sinfo -o "%100E %9u %19H %N"
REASON                                                                 
                              USER      TIMESTAMP           NODELIST
image                                                                  
                              root      2019-01-21T17:03:49 r8n32
image : reboot issued : reboot issued : reboot issued : reboot issued :
reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[1-3]
image : reboot issued : reboot issued : reboot issued : reboot issued :
reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[4-10]
image : reboot issued : reboot issued : reboot issued : reboot issued :
reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[11-15]
image : reboot issued : reboot issued : reboot issued : reboot issued :
reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[16-23]
image : reboot issued : reboot issued : reboot issued : reboot issued :
reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[24-29]
image : reboot issued : reboot issued : reboot issued : reboot issued :
reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[30-31]

For as long as the allocated node (r8n32) has not been rebooted, the
"reboot issued" message keeps appending to the reason for all other
nodes, and the ResumeTimeout is ignored. Even worse: the other nodes
get stuck in an endless reboot loop. It seems like they keep getting
the instruction to reboot. As soon as I cancel the reboot for the
allocated node, the reboot loop stops for all other nodes. 

This also happens if we do the reboot command in a loop:

root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume
reason=image r8n$n; done

So it seems that Slurm somehow groups all nodes that need to be
rebooted together, and issues reboot commands to them until the last
one of them is ready to reboot. This happens regardless of whether the
scontrol command has been issued for all nodes at once or
independently.

I should add that the command works fine if we need to reboot just one
node, or for couple of nodes that were already idle to begin with. The
RebootProgram is /sbin/reboot, so nothing out of the ordinary.

Best regards,
Martijn Kruiten
-- 

    | System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |

        | T +31 6 20043417 | martijn.kruiten at surfsara.nl | https://surfsara.nl |

         We are ISO 27001 certified and meet the high requirements for information security. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190121/6c0abbca/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4807 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190121/6c0abbca/attachment-0001.bin>