[slurm-users] How to fix a node in state=inval?
Jan Andersen
jan at comind.io
Fri Sep 1 10:12:05 UTC 2023
I am building a cluster exclusively with dynamic nodes, which all boot
up over the network from the same system image (Debian 12); so far there
is just one physical node, as well as a vm that I have used for the
initial tests:
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 inval gpu18c04d858b05
all* up infinite 1 down* node080027aea419
When I compare what the master node thinks of gpu18c04d858b05 with what
the node itself reports, they seem to agree:
On gpu18c04d858b05:
root at gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1
CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240
UpTime=0-18:04:06
And on the master:
# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:geforce:1
NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
(2023-05-08)
RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0
Weight=1 Owner=N/A MCS_label=N/A
Partitions=all
BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
CfgTRES=cpu=16,mem=64240M,billing=16
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=hang [root at 2023-08-31T16:38:27]
I tried to fix it with:
# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume
However, that made no difference; what is the next step in
troubleshooting this issue?
More information about the slurm-users
mailing list