[slurm-users] After reboot nodes are in state = down

Juergen Salk juergen.salk at uni-ulm.de
Fri Sep 27 09:19:16 UTC 2019


Hi Rafał,

you may try setting `ReturnToService=2´ in slurm.conf. 

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

* Rafał Kędziorski <rafal.kedziorski at gmail.com> [190927 10:36]:
> Hi Andreas,
> 
> my Cluster is not running whole time. I call just sudo shutdown. And after
> boot the nodes are in state down.
> 
> I'm using Slurn on Raspi Cluster (5* Pi 4). What is the best way to
> shutdown the nodes that after boot the nodes are idle and not down?
> 
> 
> Regards,
> Rafal
> 
> Am Fr., 27. Sept. 2019 um 08:43 Uhr schrieb Henkel, Andreas <
> henkel at uni-mainz.de>:
> 
> > Hi Rafal,
> >
> > How do you restart the nodes? If you don’t use scontrol reboot <node>
> > Slurm doesn’t expect nodes to reboot therefore you see that reason in those
> > cases.
> >
> > Best
> > Andreas
> >
> > Am 27.09.2019 um 07:53 schrieb Rafał Kędziorski <
> > rafal.kedziorski at gmail.com>:
> >
> > Hi,
> >
> > I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:
> >
> > - 1 Pi 4 as manager
> > - 4 Pi 4 nodes
> >
> > This work fine. But after every restart of the nodes I get this
> >
> > cluster at pi-manager:~ $ sinfo
> > PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
> > devcluster*    up   infinite      4   down pi-4-node-[1-4]
> >
> > state. Than I can call
> >
> > sudo scontrol update NodeName=<node_name> State=RESUME
> >
> > for every node and sometimes are all nodes idle and some down
> >
> > cluster @pi-manager:~ $ sinfo
> > PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
> > devcluster*    up   infinite      2   idle pi-4-node-[1-2]
> > devcluster*    up   infinite      2   down pi-4-node-[3-4]
> >
> > Status to all nodes
> >
> > cluster at pi-manager:~ $ scontrol show nodes
> > NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
> >    CPUAlloc=0 CPUTot=4 CPULoad=0.24
> >    AvailableFeatures=(null)
> >    ActiveFeatures=(null)
> >    Gres=(null)
> >    NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
> >    OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> >    RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
> >    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> >    Partitions=devcluster
> >    BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
> >    CfgTRES=cpu=4,mem=1M,billing=4
> >    AllocTRES=
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >
> > NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
> >    CPUAlloc=0 CPUTot=4 CPULoad=0.06
> >    AvailableFeatures=(null)
> >    ActiveFeatures=(null)
> >    Gres=(null)
> >    NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
> >    OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> >    RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
> >    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> >    Partitions=devcluster
> >    BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
> >    CfgTRES=cpu=4,mem=1M,billing=4
> >    AllocTRES=
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >
> > NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
> >    CPUAlloc=0 CPUTot=4 CPULoad=0.02
> >    AvailableFeatures=(null)
> >    ActiveFeatures=(null)
> >    Gres=(null)
> >    NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
> >    OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> >    RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
> >    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> >    Partitions=devcluster
> >    BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
> >    CfgTRES=cpu=4,mem=1M,billing=4
> >    AllocTRES=
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >    Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:32]
> >
> > NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
> >    CPUAlloc=0 CPUTot=4 CPULoad=0.02
> >    AvailableFeatures=(null)
> >    ActiveFeatures=(null)
> >    Gres=(null)
> >    NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
> >    OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> >    RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
> >    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> >    Partitions=devcluster
> >    BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
> >    CfgTRES=cpu=4,mem=1M,billing=4
> >    AllocTRES=
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >    Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]
> >
> > NodeName=pi-manager Arch=armv7l CoresPerSocket=1
> >    CPUAlloc=0 CPUTot=4 CPULoad=0.00
> >    AvailableFeatures=(null)
> >    ActiveFeatures=(null)
> >    Gres=(null)
> >    NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
> >    OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
> >    RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
> >    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> >    BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
> >    CfgTRES=cpu=4,mem=1M,billing=4
> >    AllocTRES=
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > Nodes which are down, the Reason is:
> >
> > Reason=Node unexpectedly rebooted [slurm at 2019-09-19T17:39:30]
> >
> > What is the problem? But my Nodes in the Cluster are not running whole
> > time.
> >
> >
> >
> > Regards,
> > Rafal
> >
> >

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A



More information about the slurm-users mailing list