[slurm-users] scontrol reboot issue

Mon Apr 29 08:50:24 UTC 2019

Dear all,

we use "scontrol reboot asap reason=<whatever reason it is> 
nextstate=resume" to e.g. do a reboot after a kernel update.

But I must say, that works SOMETIMES. Often SLURM forgets that there is 
a maintenance for a node and therefore does not reboot the node:

ncg01            DRAINING        Kernel-Update [root at 2019-04-26T08:51:43]
ncg02            DRAINED         Kernel-Update [root at 2019-04-26T08:51:43]
ncg04            DRAINED         Kernel-Update [root at 2019-04-26T08:51:43]
ncg07            DRAINING@       Kernel-Update [root at 2019-04-26T08:51:43]
ncg08            DRAINING        Kernel-Update [root at 2019-04-26T08:51:43]
ncg10            DRAINED         Kernel-Update [root at 2019-04-26T08:51:43]

As you can see, ncg07 is still draining, the "@" means, the reboot is 
still pending.
ncg01 and ncg08 are still draining, but slurm forgot about the pending 
reboot (no "@" sign).
ncg02, ncg04 and ncg10 are already drained, but do not get rebooted.

ncg03 (not seen here) got drained and rebooted.

The following is a small excerpt from the slurmctld log:

[2019-04-26T08:51:43.267] reboot request queued for nodes ncg04
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg02
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg03
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg10
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg07
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg08
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg01
[2019-04-26T09:21:31.053] node ncg03 returned to service
[2019-04-26T11:53:15.937] Node ncg08 now responding
[2019-04-26T11:53:15.937] Node ncg02 now responding
[2019-04-26T11:56:51.565] Node ncg01 now responding
[2019-04-26T11:56:51.565] Node ncg10 now responding
[2019-04-26T11:56:51.565] Node ncg08 now responding
[2019-04-26T11:56:51.565] Node ncg04 now responding
[2019-04-26T11:56:51.565] Node ncg03 now responding
[2019-04-29T09:44:17.839] node ncg10 returned to service
[2019-04-29T09:44:21.102] node ncg02 returned to service
[2019-04-29T09:44:32.394] node ncg04 returned to service
[2019-04-29T10:20:22.557] update_node: node ncg02 state set to IDLE
[2019-04-29T10:20:32.553] update_node: node ncg04 state set to IDLE
[2019-04-29T10:20:51.897] update_node: node ncg10 state set to IDLE

Today at about 09:40 I reissued a reboot for ncg02, ncg04 and ncg10. 
This one, as these nodes were already drained, slurmctld issued the 
reboot and the nodes are now up again.

Does anyone has similar issues, or a clue, where this behaviour might 
come from?

Best
Marcus

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de