[slurm-users] Nodes will not change state from "DOWN*"/"DOWN"

Pär Lundö par.lundo at foi.se
Fri Jul 5 09:39:43 UTC 2019


Hi,

I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and 
18.10 (nodes).

My problem is that I cannot get the nodes to change its state to UP or 
IDLE from "DOWN*" ("*" indicating that the communication is lost).

I can ping both the node´s name  (its hostname) and the IP address of 
the node. I have added the IP address of the node (with only one node 
running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows:
"NodeName=lxclient10 NodeAddr=192.168.1.10 "... As stated by the 
configurator-tool.

Running "scontrol show node" the stated "REASON" is "Node unexpectedly 
rebooted".

However running "scontrol update NodeName=lxclient10 State=RESUME" the 
state is changed to IDLE. Happy with that I submit a job, the job is 
queued and submitted but job is noted as "PD" and waiting "Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher 
priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted 
via the "scontrol show node"-command).

After a while, and running "squeue" to check what is happening the job´s 
state is "CG" ("Completing").

Simultanously running "scontrol show node" I can see that the CPULoad is 
small, or 0 and no CPUs are allocated ("CPUAlloc=0").

My network is a gigabit network, no firewalls are active. Node can ping 
server and server can ping node (both IP and hostname).

Any thoughts on why this is happening?

Best regards,

P




More information about the slurm-users mailing list