[slurm-users] After reboot nodes are in state = down
Rafał Kędziorski
rafal.kedziorski at gmail.com
Fri Sep 27 08:36:15 UTC 2019
Hi Andreas,
my Cluster is not running whole time. I call just sudo shutdown. And after
boot the nodes are in state down.
I'm using Slurn on Raspi Cluster (5* Pi 4). What is the best way to
shutdown the nodes that after boot the nodes are idle and not down?
Regards,
Rafal
Am Fr., 27. Sept. 2019 um 08:43 Uhr schrieb Henkel, Andreas <
henkel at uni-mainz.de>:
> Hi Rafal,
>
> How do you restart the nodes? If you don’t use scontrol reboot <node>
> Slurm doesn’t expect nodes to reboot therefore you see that reason in those
> cases.
>
> Best
> Andreas
>
> Am 27.09.2019 um 07:53 schrieb Rafał Kędziorski <
> rafal.kedziorski at gmail.com>:
>
> Hi,
>
> I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:
>
> - 1 Pi 4 as manager
> - 4 Pi 4 nodes
>
> This work fine. But after every restart of the nodes I get this
>
> cluster at pi-manager:~ $ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> devcluster* up infinite 4 down pi-4-node-[1-4]
>
> state. Than I can call
>
> sudo scontrol update NodeName=<node_name> State=RESUME
>
> for every node and sometimes are all nodes idle and some down
>
> cluster @pi-manager:~ $ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> devcluster* up infinite 2 idle pi-4-node-[1-2]
> devcluster* up infinite 2 down pi-4-node-[3-4]
>
> Status to all nodes
>
> cluster at pi-manager:~ $ scontrol show nodes
> NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
> CPUAlloc=0 CPUTot=4 CPULoad=0.24
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
> OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=devcluster
> BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
> CfgTRES=cpu=4,mem=1M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
> CPUAlloc=0 CPUTot=4 CPULoad=0.06
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
> OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=devcluster
> BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
> CfgTRES=cpu=4,mem=1M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
> CPUAlloc=0 CPUTot=4 CPULoad=0.02
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
> OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=devcluster
> BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
> CfgTRES=cpu=4,mem=1M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:32]
>
> NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
> CPUAlloc=0 CPUTot=4 CPULoad=0.02
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
> OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=devcluster
> BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
> CfgTRES=cpu=4,mem=1M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]
>
> NodeName=pi-manager Arch=armv7l CoresPerSocket=1
> CPUAlloc=0 CPUTot=4 CPULoad=0.00
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
> OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
> CfgTRES=cpu=4,mem=1M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Nodes which are down, the Reason is:
>
> Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]
>
> What is the problem? But my Nodes in the Cluster are not running whole
> time.
>
>
>
> Regards,
> Rafal
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190927/2c6c0c4c/attachment-0001.htm>
More information about the slurm-users
mailing list