[slurm-users] scontrol reboot issue
Marcus Wagner
wagner at itc.rwth-aachen.de
Mon Apr 29 08:50:24 UTC 2019
Dear all,
we use "scontrol reboot asap reason=<whatever reason it is>
nextstate=resume" to e.g. do a reboot after a kernel update.
But I must say, that works SOMETIMES. Often SLURM forgets that there is
a maintenance for a node and therefore does not reboot the node:
ncg01 DRAINING Kernel-Update [root at 2019-04-26T08:51:43]
ncg02 DRAINED Kernel-Update [root at 2019-04-26T08:51:43]
ncg04 DRAINED Kernel-Update [root at 2019-04-26T08:51:43]
ncg07 DRAINING@ Kernel-Update [root at 2019-04-26T08:51:43]
ncg08 DRAINING Kernel-Update [root at 2019-04-26T08:51:43]
ncg10 DRAINED Kernel-Update [root at 2019-04-26T08:51:43]
As you can see, ncg07 is still draining, the "@" means, the reboot is
still pending.
ncg01 and ncg08 are still draining, but slurm forgot about the pending
reboot (no "@" sign).
ncg02, ncg04 and ncg10 are already drained, but do not get rebooted.
ncg03 (not seen here) got drained and rebooted.
The following is a small excerpt from the slurmctld log:
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg04
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg02
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg03
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg10
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg07
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg08
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg01
[2019-04-26T09:21:31.053] node ncg03 returned to service
[2019-04-26T11:53:15.937] Node ncg08 now responding
[2019-04-26T11:53:15.937] Node ncg02 now responding
[2019-04-26T11:56:51.565] Node ncg01 now responding
[2019-04-26T11:56:51.565] Node ncg10 now responding
[2019-04-26T11:56:51.565] Node ncg08 now responding
[2019-04-26T11:56:51.565] Node ncg04 now responding
[2019-04-26T11:56:51.565] Node ncg03 now responding
[2019-04-29T09:44:17.839] node ncg10 returned to service
[2019-04-29T09:44:21.102] node ncg02 returned to service
[2019-04-29T09:44:32.394] node ncg04 returned to service
[2019-04-29T10:20:22.557] update_node: node ncg02 state set to IDLE
[2019-04-29T10:20:32.553] update_node: node ncg04 state set to IDLE
[2019-04-29T10:20:51.897] update_node: node ncg10 state set to IDLE
Today at about 09:40 I reissued a reboot for ncg02, ncg04 and ncg10.
This one, as these nodes were already drained, slurmctld issued the
reboot and the nodes are now up again.
Does anyone has similar issues, or a clue, where this behaviour might
come from?
Best
Marcus
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
More information about the slurm-users
mailing list