[slurm-users] Nodes stay drained no matter what I do

Thu Aug 24 22:13:32 UTC 2023

Hi Rob -

Thanks for this suggestion. I'm sure I restarted slurmd on the nodes 
multiple times with nothing in the slurm log file on the node, but after

   # tail -f /var/slurm-llnl/slurmd.log
   # systemctl restart slurmd

I started to get errors in the log which eventually lead me to the solution.

To save future users the days of frustration I just experienced, here is 
what I discovered.

All the problems were confined to the shared slurm.conf file.  As a 
reminder, all this just worked in Slurm 17.x.

Slurm 19.05 no longer likes this syntax:

   NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40

The log file on the node included this error message:

   error: Node configuration differs from hardware: Procs=40:40(hw) 
Boards=1:1(hw) SocketsPerBoard=40:2(hw) CoresPerSocket=1:10(hw) 
ThreadsPerCore=1:2(hw)

Notice that it's somehow auto-detecting the wrong hardware information. 
The solution was to just use precisely what's reported by

   slurmd -C

on the node:

   NodeName=titan-[3-15] Gres=gpu:titanv:8 CPUs=40 Boards=1 
SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000

But that wasn't the only issue.  There was also this:

   WARNING: A line in gres.conf for GRES gpu has 8 more configured than 
expected in slurm.conf. Ignoring extra GRES.

It's calling this a warning, but

   # scontrol show node titan-11 | grep Reason

revealed that this match was causing the node to drain immediately after 
being set to idle.  The problem was this:

    Gres=gpu:titanv:8

               ^
               |

For some reason this syntax was acceptable to Slurm 17, but not Slurm 
19.  The fix was

   Gres=gpu:titanv:8  -->  Gres=gpu:8

Final correct NodeName syntax:

   NodeName=titan-[3-15] Gres=gpu:8 CPUs=40 Boards=1 SocketsPerBoard=2 
CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000

Researching all this raised a number of questions, e.g. do I need to 
express CPU affinity in gres.conf, but at least the users now have at 
least the functionality they enjoyed previously.

On 8/24/23 11:16, Groner, Rob wrote:
> Ya, I agree about the invalid argument not being much help.
> 
> In times past when I encountered issues like that, I typically tried:
> 
>   * restart slurmd on the compute node.  Watch its log to see what it
>     complains about.  Usually it's about memory.
>   * Set the configuration of the node to whatever slurmd -C says, or set
>     config_overrides in slurm.conf
> 
> Rob
> 
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of 
> Patrick Goetz <pgoetz at math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do
> 
> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
> 
> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
> re-used the original slurm.conf (fearing this might cause issues).  The
> hardware is the same.  The Master and nodes all use the same slurm.conf,
> gres.conf, and cgroup.conf files which are soft linked into
> /etc/slurm-llnl from an NFS mounted filesystem.
> 
> As per the subject, the nodes refuse to revert to idle:
> 
> -----------------------------------------------------------
> root at hypnotoad:~# sinfo -N -l
> Thu Aug 24 10:01:20 2023
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> dgx-2          1       dgx     drained   80   80:1:1 500000        0
>    1   (null) gres/gpu count repor
> dgx-3          1       dgx     drained   80   80:1:1 500000        0
>    1   (null) gres/gpu count repor
> dgx-4          1       dgx     drained   80   80:1:1 500000        0
>    1   (null) gres/gpu count
> ...
> titan-3        1   titans*     drained   40   40:1:1 250000        0
>    1   (null) gres/gpu count report
> ...
> -----------------------------------------------------------
> 
> Neither of these commands has any effect:
> 
>     scontrol update NodeName=dgx-[2-6] State=RESUME
>     scontrol update state=idle nodename=dgx-[2-6]
> 
> 
> When I check the slurmctld log I find this helpful information:
> 
> -----------------------------------------------------------
> ...
> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
> node=dgx-4: Invalid argument
> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
> node=dgx-2: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-12: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-11: Invalid argument
> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
> node=dgx-6: Invalid argument
> ...
> -----------------------------------------------------------
> 
> Googling, this appears to indicate that there is a resource mismatch
> between the actual hardware and what is specified in slurm.conf. Note
> that the existing configuration worked for Slurm 17, but I checked, and
> it looks fine to me:
> 
> Relevant parts of slurm.conf:
> 
> -----------------------------------------------------------
>     SchedulerType=sched/backfill
>     SelectType=select/cons_res
>     SelectTypeParameters=CR_Core_Memory
> 
>     PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
> MaxTime=UNLIMITED
>     PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
> 
>     GresTypes=gpu
>     NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
>     NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
>     NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
> -----------------------------------------------------------
> 
> All the nodes in the titan partition are identical hardware, as are the
> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
> longer under warranty.  So, using a couple of representative nodes:
> 
> root at dgx-4:~# slurmd -C
> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
> ThreadsPerCore=2 RealMemory=515846
> 
> root at titan-8:~# slurmd -C
> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
> ThreadsPerCore=2 RealMemory=257811
> 
> 
> I'm at a loss for how to debug this and am looking suggestions. Since
> the resources on these machines are strictly dedicated to Slurm jobs,
> would it be best to use the output of `slurmd -C` directly for the right
> hand side of NodeName, reducing the memory a bit for OS overhead? Is
> there any way to get better debugging output? "Invalid argument" doesn't
> tell me much.
> 
> Thanks.
> 
> 
> 
> 
> 
> 
>