[slurm-users] Apparent scontrol reboot bug
Bas van der Vlies
bas.vandervlies at surfsara.nl
Tue Jan 22 16:10:34 UTC 2019
Thanks for the update. We gonna try to build a new package and test it.
On 22/01/2019 15:30, Douglas Jacobsen wrote:
> There were several related commits last week:
> https://github.com/SchedMD/slurm/commits/slurm-18.08
>
>
> On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen <dmjacobsen at lbl.gov
> <mailto:dmjacobsen at lbl.gov>> wrote:
>
> Hello,
>
> Yes it's a bug in the way the reboot rpcs are handled. A fix was
> recently committed which we have yet to test, but 18.08.5 is meant
> to repair this (among other things).
>
> Doug
>
> On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten
> <martijn.kruiten at surfsara.nl <mailto:martijn.kruiten at surfsara.nl>>
> wrote:
>
> Hi,
>
> We encounter a strange issue on our system (Slurm 18.08.3), and
> I'm curious whether anyone of you recognizes this behavior. In
> the following example we try to reboot 32 nodes, of which 31
> nodes are idle:
>
> root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
> root# sinfo -o "%100E %9u %19H %N"
> REASON USER TIMESTAMP NODELIST
> image root 2019-01-21T17:03:49
> r8n32
> image : reboot issued : reboot issued : reboot issued : reboot
> issued : reboot issued : reboot issue
> root 2019-01-21T17:03:47 r8n[1-3]
> image : reboot issued : reboot issued : reboot issued : reboot
> issued : reboot issued : reboot issue
> root 2019-01-21T17:03:47 r8n[4-10]
> image : reboot issued : reboot issued : reboot issued : reboot
> issued : reboot issued : reboot issue
> root 2019-01-21T17:03:48 r8n[11-15]
> image : reboot issued : reboot issued : reboot issued : reboot
> issued : reboot issued : reboot issue
> root 2019-01-21T17:03:48 r8n[16-23]
> image : reboot issued : reboot issued : reboot issued : reboot
> issued : reboot issued : reboot issue
> root 2019-01-21T17:03:49 r8n[24-29]
> image : reboot issued : reboot issued : reboot issued : reboot
> issued : reboot issued : reboot issue
> root 2019-01-21T17:03:49 r8n[30-31]
>
> For as long as the allocated node (r8n32) has not been rebooted,
> the "reboot issued" message keeps appending to the reason for
> all other nodes, and the ResumeTimeout is ignored. Even worse:
> the other nodes get stuck in an endless reboot loop. It seems
> like they keep getting the instruction to reboot. As soon as I
> cancel the reboot for the allocated node, the reboot loop stops
> for all other nodes.
>
> This also happens if we do the reboot command in a loop:
>
> root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume
> reason=image r8n$n; done
>
> So it seems that Slurm somehow groups all nodes that need to be
> rebooted together, and issues reboot commands to them until the
> last one of them is ready to reboot. This happens regardless of
> whether the scontrol command has been issued for all nodes at
> once or independently.
>
> I should add that the command works fine if we need to reboot
> just one node, or for couple of nodes that were already idle to
> begin with. The RebootProgram is /sbin/reboot, so nothing out of
> the ordinary.
>
> Best regards,
>
> Martijn Kruiten
>
> --
>
> | System Programmer | SURFsara | Science Park 140 | 1098
> XG Amsterdam |
> | T +31 6 20043417 | martijn.kruiten at surfsara.nl
> <mailto:bas.vandervlies at surfsara.nl> | www.surfsara.nl
> <http://www.surfsara.nl> |
>
> --
> Sent from Gmail Mobile
>
> --
> Sent from Gmail Mobile
--
--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098
XG Amsterdam
| T +31 (0) 20 800 1300 | bas.vandervlies at surfsara.nl | www.surfsara.nl |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3601 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/1aaa561c/attachment-0001.bin>
More information about the slurm-users
mailing list