[slurm-users] Nodes stay drained no matter what I do

Thu Aug 24 15:27:19 UTC 2023

Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)

This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I 
re-used the original slurm.conf (fearing this might cause issues).  The 
hardware is the same.  The Master and nodes all use the same slurm.conf, 
gres.conf, and cgroup.conf files which are soft linked into 
/etc/slurm-llnl from an NFS mounted filesystem.

As per the subject, the nodes refuse to revert to idle:

-----------------------------------------------------------
root at hypnotoad:~# sinfo -N -l
Thu Aug 24 10:01:20 2023
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK 
WEIGHT AVAIL_FE REASON
dgx-2          1       dgx     drained   80   80:1:1 500000        0 
  1   (null) gres/gpu count repor
dgx-3          1       dgx     drained   80   80:1:1 500000        0 
  1   (null) gres/gpu count repor
dgx-4          1       dgx     drained   80   80:1:1 500000        0 
  1   (null) gres/gpu count
...
titan-3        1   titans*     drained   40   40:1:1 250000        0 
  1   (null) gres/gpu count report
...
-----------------------------------------------------------

Neither of these commands has any effect:

   scontrol update NodeName=dgx-[2-6] State=RESUME
   scontrol update state=idle nodename=dgx-[2-6]

When I check the slurmctld log I find this helpful information:

-----------------------------------------------------------
...
[2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration 
node=dgx-4: Invalid argument
[2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration 
node=dgx-2: Invalid argument
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration 
node=titan-12: Invalid argument
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration 
node=titan-11: Invalid argument
[2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration 
node=dgx-6: Invalid argument
...
-----------------------------------------------------------

Googling, this appears to indicate that there is a resource mismatch 
between the actual hardware and what is specified in slurm.conf. Note 
that the existing configuration worked for Slurm 17, but I checked, and 
it looks fine to me:

Relevant parts of slurm.conf:

-----------------------------------------------------------
   SchedulerType=sched/backfill
   SelectType=select/cons_res
   SelectTypeParameters=CR_Core_Memory

   PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP 
MaxTime=UNLIMITED
   PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED

   GresTypes=gpu
   NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
   NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
   NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
-----------------------------------------------------------

All the nodes in the titan partition are identical hardware, as are the 
nodes in the dgx partition save for dgx-2, which lost a GPU and is no 
longer under warranty.  So, using a couple of representative nodes:

root at dgx-4:~# slurmd -C
NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 
ThreadsPerCore=2 RealMemory=515846

root at titan-8:~# slurmd -C
NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=2 RealMemory=257811

I'm at a loss for how to debug this and am looking suggestions. Since 
the resources on these machines are strictly dedicated to Slurm jobs, 
would it be best to use the output of `slurmd -C` directly for the right 
hand side of NodeName, reducing the memory a bit for OS overhead? Is 
there any way to get better debugging output? "Invalid argument" doesn't 
tell me much.

Thanks.