[slurm-users] Nodes will not change state from "DOWN*"/"DOWN"

Pär Lundö par.lundo at foi.se
Fri Jul 5 11:15:15 UTC 2019


Hi,

Managed to isolate the problem, and it was that the slurm uid was not 
the same across the network.
Now (simple) job runs without problem.

Regards,

P

On 2019-07-05 11:39, Pär Lundö wrote:
> Hi,
>
> I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and 
> 18.10 (nodes).
>
> My problem is that I cannot get the nodes to change its state to UP or 
> IDLE from "DOWN*" ("*" indicating that the communication is lost).
>
> I can ping both the node´s name  (its hostname) and the IP address of 
> the node. I have added the IP address of the node (with only one node 
> running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows:
> "NodeName=lxclient10 NodeAddr=192.168.1.10 "... As stated by the 
> configurator-tool.
>
> Running "scontrol show node" the stated "REASON" is "Node unexpectedly 
> rebooted".
>
> However running "scontrol update NodeName=lxclient10 State=RESUME" the 
> state is changed to IDLE. Happy with that I submit a job, the job is 
> queued and submitted but job is noted as "PD" and waiting "Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher 
> priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted 
> via the "scontrol show node"-command).
>
> After a while, and running "squeue" to check what is happening the 
> job´s state is "CG" ("Completing").
>
> Simultanously running "scontrol show node" I can see that the CPULoad 
> is small, or 0 and no CPUs are allocated ("CPUAlloc=0").
>
> My network is a gigabit network, no firewalls are active. Node can 
> ping server and server can ping node (both IP and hostname).
>
> Any thoughts on why this is happening?
>
> Best regards,
>
> P
>
>
-- 
Hälsningar, Pär
________________________________
Pär Lundö
Forskare
Avdelningen för Ledningssystem

FOI
Totalförsvarets forskningsinstitut
164 90 Stockholm

Besöksadress:
Olau Magnus väg 33, Linköping


Tel: +46 13 37 86 01
Mob: +46 734 447 815
Vxl: +46 13 37 80 00
par.lundo at foi.se
www.foi.se




More information about the slurm-users mailing list