[slurm-users] After reboot nodes are in state = down
Rafał Kędziorski
rafal.kedziorski at gmail.com
Fri Sep 27 05:39:33 UTC 2019
Hi,
I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:
- 1 Pi 4 as manager
- 4 Pi 4 nodes
This work fine. But after every restart of the nodes I get this
cluster at pi-manager:~ $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
devcluster* up infinite 4 down pi-4-node-[1-4]
state. Than I can call
sudo scontrol update NodeName=<node_name> State=RESUME
for every node and sometimes are all nodes idle and some down
cluster @pi-manager:~ $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
devcluster* up infinite 2 idle pi-4-node-[1-2]
devcluster* up infinite 2 down pi-4-node-[3-4]
Status to all nodes
cluster at pi-manager:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.24
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=devcluster
BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.06
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=devcluster
BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.02
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=devcluster
BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:32]
NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.02
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=devcluster
BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]
NodeName=pi-manager Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Nodes which are down, the Reason is:
Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]
What is the problem? But my Nodes in the Cluster are not running whole time.
Regards,
Rafal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190927/780cee0f/attachment-0001.htm>
More information about the slurm-users
mailing list