[slurm-users] How to fix a node in state=inval?

Fri Sep 1 10:12:05 UTC 2023

I am building a cluster exclusively with dynamic nodes, which all boot 
up over the network from the same system image (Debian 12); so far there 
is just one physical node, as well as a vm that I have used for the 
initial tests:

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  inval gpu18c04d858b05
all*         up   infinite      1  down* node080027aea419

When I compare what the master node thinks of gpu18c04d858b05 with what 
the node itself reports, they seem to agree:

On gpu18c04d858b05:

root at gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 
CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240
UpTime=0-18:04:06

And on the master:

# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
    CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=gpu:geforce:1
    NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
    OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 
(2023-05-08)
    RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
    State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 
Weight=1 Owner=N/A MCS_label=N/A
    Partitions=all
    BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
    LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
    CfgTRES=cpu=16,mem=64240M,billing=16
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=hang [root at 2023-08-31T16:38:27]

I tried to fix it with:

# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume

However, that made no difference; what is the next step in 
troubleshooting this issue?