<div dir="ltr">Hi,<div><br></div><div>I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:</div><div><br></div><div>- 1 Pi 4 as manager</div><div>- 4 Pi 4 nodes</div><div><br></div><div>This work fine. But after every restart of the nodes I get this</div><div><br></div><div><font face="monospace">cluster@pi-manager:~ $ sinfo<br>PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST<br>devcluster*    up   infinite      4   down pi-4-node-[1-4]</font><br></div><div><br></div><div>state. Than I can call</div><div><br></div><div>sudo scontrol update NodeName=<node_name> State=RESUME<br><br></div><div>for every node and sometimes are all nodes idle and some down</div><div><br></div>cluster <font face="monospace">@pi-manager:~ $ sinfo<br>PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST<br>devcluster*    up   infinite      2   idle pi-4-node-[1-2]<br>devcluster*    up   infinite      2   down pi-4-node-[3-4]</font><br><div><br></div><div>Status to all nodes</div><div><br></div><div><font face="monospace">cluster@pi-manager:~ $ scontrol show nodes<br>NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1<br>   CPUAlloc=0 CPUTot=4 CPULoad=0.24<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08<br>   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019<br>   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1<br>   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=devcluster<br>   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36<br>   CfgTRES=cpu=4,mem=1M,billing=4<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br><br>NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1<br>   CPUAlloc=0 CPUTot=4 CPULoad=0.06<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08<br>   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019<br>   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1<br>   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=devcluster<br>   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49<br>   CfgTRES=cpu=4,mem=1M,billing=4<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br><br>NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1<br>   CPUAlloc=0 CPUTot=4 CPULoad=0.02<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08<br>   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019<br>   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1<br>   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=devcluster<br>   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45<br>   CfgTRES=cpu=4,mem=1M,billing=4<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>   Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:32]<br><br>NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1<br>   CPUAlloc=0 CPUTot=4 CPULoad=0.02<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08<br>   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019<br>   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1<br>   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=devcluster<br>   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47<br>   CfgTRES=cpu=4,mem=1M,billing=4<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>   Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:30]<br><br>NodeName=pi-manager Arch=armv7l CoresPerSocket=1<br>   CPUAlloc=0 CPUTot=4 CPULoad=0.00<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08<br>   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019<br>   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1<br>   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51<br>   CfgTRES=cpu=4,mem=1M,billing=4<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br></font></div><div><br></div><div>Nodes which are down, the Reason is:</div><div><br></div><div>Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:30]<br></div><div><br></div><div>What is the problem? But my Nodes in the Cluster are not running whole time.</div><div><br></div><div><br></div><div><br></div><div>Regards,</div><div>Rafal</div></div>