[slurm-users] Apparent scontrol reboot bug

Bas van der Vlies bas.vandervlies at surfsara.nl
Tue Jan 22 16:10:34 UTC 2019


Thanks for the update. We gonna try to build a new package and test it.

On 22/01/2019 15:30, Douglas Jacobsen wrote:
> There were several related commits last week:
> https://github.com/SchedMD/slurm/commits/slurm-18.08
> 
> 
> On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen <dmjacobsen at lbl.gov 
> <mailto:dmjacobsen at lbl.gov>> wrote:
> 
>     Hello,
> 
>     Yes it's a bug in the way the reboot rpcs are handled.  A fix was
>     recently committed which we have yet to test, but 18.08.5 is meant
>     to repair this (among other things).
> 
>     Doug
> 
>     On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten
>     <martijn.kruiten at surfsara.nl <mailto:martijn.kruiten at surfsara.nl>>
>     wrote:
> 
>         Hi,
> 
>         We encounter a strange issue on our system (Slurm 18.08.3), and
>         I'm curious whether anyone of you recognizes this behavior. In
>         the following example we try to reboot 32 nodes, of which 31
>         nodes are idle:
> 
>         root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
>         root# sinfo -o "%100E %9u %19H %N"
>         REASON                                                                                               USER      TIMESTAMP           NODELIST
>         image                                                                                                root      2019-01-21T17:03:49
>         r8n32
>         image : reboot issued : reboot issued : reboot issued : reboot
>         issued : reboot issued : reboot issue
>         root      2019-01-21T17:03:47 r8n[1-3]
>         image : reboot issued : reboot issued : reboot issued : reboot
>         issued : reboot issued : reboot issue
>         root      2019-01-21T17:03:47 r8n[4-10]
>         image : reboot issued : reboot issued : reboot issued : reboot
>         issued : reboot issued : reboot issue
>         root      2019-01-21T17:03:48 r8n[11-15]
>         image : reboot issued : reboot issued : reboot issued : reboot
>         issued : reboot issued : reboot issue
>         root      2019-01-21T17:03:48 r8n[16-23]
>         image : reboot issued : reboot issued : reboot issued : reboot
>         issued : reboot issued : reboot issue
>         root      2019-01-21T17:03:49 r8n[24-29]
>         image : reboot issued : reboot issued : reboot issued : reboot
>         issued : reboot issued : reboot issue
>         root      2019-01-21T17:03:49 r8n[30-31]
> 
>         For as long as the allocated node (r8n32) has not been rebooted,
>         the "reboot issued" message keeps appending to the reason for
>         all other nodes, and the ResumeTimeout is ignored. Even worse:
>         the other nodes get stuck in an endless reboot loop. It seems
>         like they keep getting the instruction to reboot. As soon as I
>         cancel the reboot for the allocated node, the reboot loop stops
>         for all other nodes.
> 
>         This also happens if we do the reboot command in a loop:
> 
>         root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume
>         reason=image r8n$n; done
> 
>         So it seems that Slurm somehow groups all nodes that need to be
>         rebooted together, and issues reboot commands to them until the
>         last one of them is ready to reboot. This happens regardless of
>         whether the scontrol command has been issued for all nodes at
>         once or independently.
> 
>         I should add that the command works fine if we need to reboot
>         just one node, or for couple of nodes that were already idle to
>         begin with. The RebootProgram is /sbin/reboot, so nothing out of
>         the ordinary.
> 
>         Best regards,
> 
>         Martijn Kruiten
> 
>         -- 
> 
>         | System Programmer | SURFsara | Science Park 140 | 1098
>         XG Amsterdam |
>         | T +31 6 20043417  | martijn.kruiten at surfsara.nl
>         <mailto:bas.vandervlies at surfsara.nl> | www.surfsara.nl
>         <http://www.surfsara.nl> |
> 
>     -- 
>     Sent from Gmail Mobile
> 
> -- 
> Sent from Gmail Mobile

-- 
--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG  Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervlies at surfsara.nl | www.surfsara.nl |

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3601 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/1aaa561c/attachment-0001.bin>


More information about the slurm-users mailing list