[slurm-users] Apparent scontrol reboot bug

Tue Jan 22 14:30:49 UTC 2019

There were several related commits last week:
https://github.com/SchedMD/slurm/commits/slurm-18.08

On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen <dmjacobsen at lbl.gov> wrote:

> Hello,
>
> Yes it's a bug in the way the reboot rpcs are handled.  A fix was recently
> committed which we have yet to test, but 18.08.5 is meant to repair this
> (among other things).
>
> Doug
>
> On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten <martijn.kruiten at surfsara.nl>
> wrote:
>
>> Hi,
>>
>> We encounter a strange issue on our system (Slurm 18.08.3), and I'm
>> curious whether anyone of you recognizes this behavior. In the following
>> example we try to reboot 32 nodes, of which 31 nodes are idle:
>>
>> root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
>> root# sinfo -o "%100E %9u %19H %N"
>>
>> REASON                                                                                               USER      TIMESTAMP           NODELIST
>> image                                                                                                root      2019-01-21T17:03:49
>> r8n32
>> image : reboot issued : reboot issued : reboot issued : reboot issued :
>> reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[1-3]
>> image : reboot issued : reboot issued : reboot issued : reboot issued :
>> reboot issued : reboot issue root      2019-01-21T17:03:47 r8n[4-10]
>> image : reboot issued : reboot issued : reboot issued : reboot issued :
>> reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[11-15]
>> image : reboot issued : reboot issued : reboot issued : reboot issued :
>> reboot issued : reboot issue root      2019-01-21T17:03:48 r8n[16-23]
>> image : reboot issued : reboot issued : reboot issued : reboot issued :
>> reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[24-29]
>> image : reboot issued : reboot issued : reboot issued : reboot issued :
>> reboot issued : reboot issue root      2019-01-21T17:03:49 r8n[30-31]
>>
>> For as long as the allocated node (r8n32) has not been rebooted, the
>> "reboot issued" message keeps appending to the reason for all other nodes,
>> and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in
>> an endless reboot loop. It seems like they keep getting the instruction to
>> reboot. As soon as I cancel the reboot for the allocated node, the reboot
>> loop stops for all other nodes.
>>
>> This also happens if we do the reboot command in a loop:
>>
>> root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume
>> reason=image r8n$n; done
>>
>> So it seems that Slurm somehow groups all nodes that need to be rebooted
>> together, and issues reboot commands to them until the last one of them is
>> ready to reboot. This happens regardless of whether the scontrol command
>> has been issued for all nodes at once or independently.
>>
>> I should add that the command works fine if we need to reboot just one
>> node, or for couple of nodes that were already idle to begin with. The
>> RebootProgram is /sbin/reboot, so nothing out of the ordinary.
>>
>> Best regards,
>>
>> Martijn Kruiten
>>
>> --
>>
>> | System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |
>> | T +31 6 20043417  | martijn.kruiten at surfsara.nl
>> <bas.vandervlies at surfsara.nl> | www.surfsara.nl |
>>
> --
> Sent from Gmail Mobile
>
-- 
Sent from Gmail Mobile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/b6202893/attachment.html>