[slurm-users] Nodes stay drained no matter what I do
Patrick Goetz
pgoetz at math.utexas.edu
Thu Aug 24 22:13:32 UTC 2023
Hi Rob -
Thanks for this suggestion. I'm sure I restarted slurmd on the nodes
multiple times with nothing in the slurm log file on the node, but after
# tail -f /var/slurm-llnl/slurmd.log
# systemctl restart slurmd
I started to get errors in the log which eventually lead me to the solution.
To save future users the days of frustration I just experienced, here is
what I discovered.
All the problems were confined to the shared slurm.conf file. As a
reminder, all this just worked in Slurm 17.x.
Slurm 19.05 no longer likes this syntax:
NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
The log file on the node included this error message:
error: Node configuration differs from hardware: Procs=40:40(hw)
Boards=1:1(hw) SocketsPerBoard=40:2(hw) CoresPerSocket=1:10(hw)
ThreadsPerCore=1:2(hw)
Notice that it's somehow auto-detecting the wrong hardware information.
The solution was to just use precisely what's reported by
slurmd -C
on the node:
NodeName=titan-[3-15] Gres=gpu:titanv:8 CPUs=40 Boards=1
SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000
But that wasn't the only issue. There was also this:
WARNING: A line in gres.conf for GRES gpu has 8 more configured than
expected in slurm.conf. Ignoring extra GRES.
It's calling this a warning, but
# scontrol show node titan-11 | grep Reason
revealed that this match was causing the node to drain immediately after
being set to idle. The problem was this:
Gres=gpu:titanv:8
^
|
For some reason this syntax was acceptable to Slurm 17, but not Slurm
19. The fix was
Gres=gpu:titanv:8 --> Gres=gpu:8
Final correct NodeName syntax:
NodeName=titan-[3-15] Gres=gpu:8 CPUs=40 Boards=1 SocketsPerBoard=2
CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000
Researching all this raised a number of questions, e.g. do I need to
express CPU affinity in gres.conf, but at least the users now have at
least the functionality they enjoyed previously.
On 8/24/23 11:16, Groner, Rob wrote:
> Ya, I agree about the invalid argument not being much help.
>
> In times past when I encountered issues like that, I typically tried:
>
> * restart slurmd on the compute node. Watch its log to see what it
> complains about. Usually it's about memory.
> * Set the configuration of the node to whatever slurmd -C says, or set
> config_overrides in slurm.conf
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Patrick Goetz <pgoetz at math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do
>
> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
>
> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
> re-used the original slurm.conf (fearing this might cause issues). The
> hardware is the same. The Master and nodes all use the same slurm.conf,
> gres.conf, and cgroup.conf files which are soft linked into
> /etc/slurm-llnl from an NFS mounted filesystem.
>
> As per the subject, the nodes refuse to revert to idle:
>
> -----------------------------------------------------------
> root at hypnotoad:~# sinfo -N -l
> Thu Aug 24 10:01:20 2023
> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> dgx-2 1 dgx drained 80 80:1:1 500000 0
> 1 (null) gres/gpu count repor
> dgx-3 1 dgx drained 80 80:1:1 500000 0
> 1 (null) gres/gpu count repor
> dgx-4 1 dgx drained 80 80:1:1 500000 0
> 1 (null) gres/gpu count
> ...
> titan-3 1 titans* drained 40 40:1:1 250000 0
> 1 (null) gres/gpu count report
> ...
> -----------------------------------------------------------
>
> Neither of these commands has any effect:
>
> scontrol update NodeName=dgx-[2-6] State=RESUME
> scontrol update state=idle nodename=dgx-[2-6]
>
>
> When I check the slurmctld log I find this helpful information:
>
> -----------------------------------------------------------
> ...
> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
> node=dgx-4: Invalid argument
> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
> node=dgx-2: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-12: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-11: Invalid argument
> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
> node=dgx-6: Invalid argument
> ...
> -----------------------------------------------------------
>
> Googling, this appears to indicate that there is a resource mismatch
> between the actual hardware and what is specified in slurm.conf. Note
> that the existing configuration worked for Slurm 17, but I checked, and
> it looks fine to me:
>
> Relevant parts of slurm.conf:
>
> -----------------------------------------------------------
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
>
> PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
> MaxTime=UNLIMITED
> PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
>
> GresTypes=gpu
> NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
> NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
> NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
> -----------------------------------------------------------
>
> All the nodes in the titan partition are identical hardware, as are the
> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
> longer under warranty. So, using a couple of representative nodes:
>
> root at dgx-4:~# slurmd -C
> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
> ThreadsPerCore=2 RealMemory=515846
>
> root at titan-8:~# slurmd -C
> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
> ThreadsPerCore=2 RealMemory=257811
>
>
> I'm at a loss for how to debug this and am looking suggestions. Since
> the resources on these machines are strictly dedicated to Slurm jobs,
> would it be best to use the output of `slurmd -C` directly for the right
> hand side of NodeName, reducing the memory a bit for OS overhead? Is
> there any way to get better debugging output? "Invalid argument" doesn't
> tell me much.
>
> Thanks.
>
>
>
>
>
>
>
More information about the slurm-users
mailing list