[slurm-users] Nodes will not change state from "DOWN*"/"DOWN"
Pär Lundö
par.lundo at foi.se
Fri Jul 5 09:39:43 UTC 2019
Hi,
I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and
18.10 (nodes).
My problem is that I cannot get the nodes to change its state to UP or
IDLE from "DOWN*" ("*" indicating that the communication is lost).
I can ping both the node´s name (its hostname) and the IP address of
the node. I have added the IP address of the node (with only one node
running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows:
"NodeName=lxclient10 NodeAddr=192.168.1.10 "... As stated by the
configurator-tool.
Running "scontrol show node" the stated "REASON" is "Node unexpectedly
rebooted".
However running "scontrol update NodeName=lxclient10 State=RESUME" the
state is changed to IDLE. Happy with that I submit a job, the job is
queued and submitted but job is noted as "PD" and waiting "Nodes
required for job are DOWN, DRAINED or reserved for jobs in higher
priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted
via the "scontrol show node"-command).
After a while, and running "squeue" to check what is happening the job´s
state is "CG" ("Completing").
Simultanously running "scontrol show node" I can see that the CPULoad is
small, or 0 and no CPUs are allocated ("CPUAlloc=0").
My network is a gigabit network, no firewalls are active. Node can ping
server and server can ping node (both IP and hostname).
Any thoughts on why this is happening?
Best regards,
P
More information about the slurm-users
mailing list