[slurm-users] Nodes will not change state from "DOWN*"/"DOWN"

Pär Lundö par.lundo at foi.se
Fri Jul 5 11:15:15 UTC 2019


Managed to isolate the problem, and it was that the slurm uid was not 
the same across the network.
Now (simple) job runs without problem.



On 2019-07-05 11:39, Pär Lundö wrote:
> Hi,
> I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and 
> 18.10 (nodes).
> My problem is that I cannot get the nodes to change its state to UP or 
> IDLE from "DOWN*" ("*" indicating that the communication is lost).
> I can ping both the node´s name  (its hostname) and the IP address of 
> the node. I have added the IP address of the node (with only one node 
> running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows:
> "NodeName=lxclient10 NodeAddr= "... As stated by the 
> configurator-tool.
> Running "scontrol show node" the stated "REASON" is "Node unexpectedly 
> rebooted".
> However running "scontrol update NodeName=lxclient10 State=RESUME" the 
> state is changed to IDLE. Happy with that I submit a job, the job is 
> queued and submitted but job is noted as "PD" and waiting "Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher 
> priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted 
> via the "scontrol show node"-command).
> After a while, and running "squeue" to check what is happening the 
> job´s state is "CG" ("Completing").
> Simultanously running "scontrol show node" I can see that the CPULoad 
> is small, or 0 and no CPUs are allocated ("CPUAlloc=0").
> My network is a gigabit network, no firewalls are active. Node can 
> ping server and server can ping node (both IP and hostname).
> Any thoughts on why this is happening?
> Best regards,
> P
Hälsningar, Pär
Pär Lundö
Avdelningen för Ledningssystem

Totalförsvarets forskningsinstitut
164 90 Stockholm

Olau Magnus väg 33, Linköping

Tel: +46 13 37 86 01
Mob: +46 734 447 815
Vxl: +46 13 37 80 00
par.lundo at foi.se

More information about the slurm-users mailing list