[slurm-users] After reboot nodes are in state = down

Rafał Kędziorski rafal.kedziorski at gmail.com
Fri Sep 27 05:39:33 UTC 2019


Hi,

I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:

- 1 Pi 4 as manager
- 4 Pi 4 nodes

This work fine. But after every restart of the nodes I get this

cluster at pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      4   down pi-4-node-[1-4]

state. Than I can call

sudo scontrol update NodeName=<node_name> State=RESUME

for every node and sometimes are all nodes idle and some down

cluster @pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      2   idle pi-4-node-[1-2]
devcluster*    up   infinite      2   down pi-4-node-[3-4]

Status to all nodes

cluster at pi-manager:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.24
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:32]

NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]

NodeName=pi-manager Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Nodes which are down, the Reason is:

Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]

What is the problem? But my Nodes in the Cluster are not running whole time.



Regards,
Rafal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190927/780cee0f/attachment-0001.htm>


More information about the slurm-users mailing list