<html dir="ltr"><head></head><body style="text-align:left; direction:ltr;"><div><span><div>Hi,</div><div><br></div><div>We encounter a strange issue on our system (Slurm 18.08.3), and I'm curious whether anyone of you recognizes this behavior. In the following example we try to reboot 32 nodes, of which 31 nodes are idle:</div><div><br></div><div><font face="monospace" size="3">root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]</font></div><div><font face="monospace" size="3">root# sinfo -o "%100E %9u %19H %N"</font></div><div><font face="monospace" size="3">REASON USER TIMESTAMP NODELIST</font></div><div><font face="monospace" size="3">image root 2019-01-21T17:03:49 r8n32</font></div><div><font face="monospace" size="3">image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[1-3]</font></div><div><font face="monospace" size="3">image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[4-10]</font></div><div><font face="monospace" size="3">image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[11-15]</font></div><div><font face="monospace" size="3">image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[16-23]</font></div><div><font face="monospace" size="3">image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[24-29]</font></div><div><font face="monospace" size="3">image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[30-31]</font></div><div><br></div><div>For as long as the allocated node (r8n32) has not been rebooted, the "reboot issued" message keeps appending to the reason for all other nodes, and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in an endless reboot loop. It seems like they keep getting the instruction to reboot. As soon as I cancel the reboot for the allocated node, the reboot loop stops for all other nodes. </div><div><span style="font-family: monospace; font-size: medium;"><br></span></div><div>This also happens if we do the reboot command in a loop:</div><div><div><br></div><div><font face="monospace" size="3">root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume reason=image r8n$n; done</font></div><div><br></div><div>So it seems that Slurm somehow groups all nodes that need to be rebooted together, and issues reboot commands to them until the last one of them is ready to reboot. This happens regardless of whether the scontrol command has been issued for all nodes at once or independently.</div><div><br></div><div>I should add that the command works fine if we need to reboot just one node, or for couple of nodes that were already idle to begin with. The RebootProgram is /sbin/reboot, so nothing out of the ordinary.</div><div><font face="monospace" size="3"><br></font></div></div><div>Best regards,</div><pre>Martijn Kruiten</pre><pre>-- <br></pre><div>| System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |</div><div data-evo-signature-plain-text-mode="">| T +31 6 20043417 | <a href="mailto:bas.vandervlies@surfsara.nl">martijn.kruiten@surfsara.nl</a> | www.surfsara.nl |</div></span></div></body></html>