[slurm-users] How to fix a node in state=inval?

Fri Sep 1 15:14:11 UTC 2023

Are you starting the slurmd via 'slurmd -Z' on the dyanmic node?

The next steps would be to check the slurmctld log from the master and slurmd log for the invalid node. Those should provide more insight into why the node is seen as invalid. If you can attach those we might be able to see the issue.

Regards,

--
Willy Markuske

HPC Systems Engineer
MS Data Science and Engineering
SDSC - Research Data Services
(619) 519-4435
wmarkuske at sdsc.edu

On Sep 1, 2023, at 03:12, Jan Andersen <jan at comind.io> wrote:

I am building a cluster exclusively with dynamic nodes, which all boot up over the network from the same system image (Debian 12); so far there is just one physical node, as well as a vm that I have used for the initial tests:

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  inval gpu18c04d858b05
all*         up   infinite      1  down* node080027aea419

When I compare what the master node thinks of gpu18c04d858b05 with what the node itself reports, they seem to agree:

On gpu18c04d858b05:

root at gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240
UpTime=0-18:04:06

And on the master:

# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
  CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
  AvailableFeatures=(null)
  ActiveFeatures=(null)
  Gres=gpu:geforce:1
  NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
  OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08)
  RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
  State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
  Partitions=all
  BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
  LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
  CfgTRES=cpu=16,mem=64240M,billing=16
  AllocTRES=
  CapWatts=n/a
  CurrentWatts=0 AveWatts=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
  Reason=hang [root at 2023-08-31T16:38:27]

I tried to fix it with:

# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume

However, that made no difference; what is the next step in troubleshooting this issue?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230901/fb133535/attachment.htm>