[slurm-users] Apparent scontrol reboot bug

Tue Jan 22 14:28:33 UTC 2019

Hello,

Yes it's a bug in the way the reboot rpcs are handled.  A fix was recently
committed which we have yet to test, but 18.08.5 is meant to repair this
(among other things).

Doug

On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten <martijn.kruiten at surfsara.nl>
wrote:

> Hi,
>
> We encounter a strange issue on our system (Slurm 18.08.3), and I'm
> curious whether anyone of you recognizes this behavior. In the following
> example we try to reboot 32 nodes, of which 31 nodes are idle:
>
> root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
> root# sinfo -o "%100E %9u %19H %N"
>
> REASON                                                                                               USER      TIMESTAMP           NODELIST
> image                                                                                                root      2019-01-21T17:03:49
> r8n32
> image : reboot issued : reboot issued : reboot issued : reboot issued :
> reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[1-3]
> image : reboot issued : reboot issued : reboot issued : reboot issued :
> reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[4-10]
> image : reboot issued : reboot issued : reboot issued : reboot issued :
> reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[11-15]
> image : reboot issued : reboot issued : reboot issued : reboot issued :
> reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[16-23]
> image : reboot issued : reboot issued : reboot issued : reboot issued :
> reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[24-29]
> image : reboot issued : reboot issued : reboot issued : reboot issued :
> reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[30-31]
>
> For as long as the allocated node (r8n32) has not been rebooted, the
> "reboot issued" message keeps appending to the reason for all other nodes,
> and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in
> an endless reboot loop. It seems like they keep getting the instruction to
> reboot. As soon as I cancel the reboot for the allocated node, the reboot
> loop stops for all other nodes.
>
> This also happens if we do the reboot command in a loop:
>
> root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume
> reason=image r8n$n; done
>
> So it seems that Slurm somehow groups all nodes that need to be rebooted
> together, and issues reboot commands to them until the last one of them is
> ready to reboot. This happens regardless of whether the scontrol command
> has been issued for all nodes at once or independently.
>
> I should add that the command works fine if we need to reboot just one
> node, or for couple of nodes that were already idle to begin with. The
> RebootProgram is /sbin/reboot, so nothing out of the ordinary.
>
> Best regards,
>
> Martijn Kruiten
>
> --
>
> | System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |
> | T +31 6 20043417  | martijn.kruiten at surfsara.nl
> <bas.vandervlies at surfsara.nl> | www.surfsara.nl |
>
-- 
Sent from Gmail Mobile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/262ca8c8/attachment-0001.html>