[slurm-users] Nodes stay drained no matter what I do
Patrick Goetz
pgoetz at math.utexas.edu
Thu Aug 24 22:17:48 UTC 2023
Hi Mick -
Thanks for these suggestions. I read over both release notes, but
didn't find anything helpful.
Note that I didn't include gres.conf in my original post. That would be
this:
NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]
Everything is working now, but some schedmd comment alerted me to this
highly useful command:
# nvidia-smi topo -m
Now I'm wondering if I should be expressing CPU affinity explicitly in
the gres.conf file.
On 8/24/23 11:24, Timony, Mick wrote:
> Hi Patrick,
>
> You may want to review the release notes for 19.05 and any intermediate
> versions:
>
> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES
> <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>
>
> https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES
> <https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>
>
> I'd also check the |slurmd.log| on the compute nodes. It's usually in
> |/var/log/slurm/slurmd.log|
>
> I'm not 100% sure your gres.conf is correct, we use one gres.conf for
> all our nodes, it looks something like this:
>
> |NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
> |NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
> |NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|
>
> SchedMd docs example is a little different as they have a unique
> gres.conf by node in their example at:
>
> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>
>
> |Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1|
>
> I don't see |Name| in your |gres.conf|?
>
> Kind regards
>
> --
> Mick Timony
> Senior DevOps Engineer
> Harvard Medical School
> --
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Patrick Goetz <pgoetz at math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do
>
> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
>
> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
> re-used the original slurm.conf (fearing this might cause issues). The
> hardware is the same. The Master and nodes all use the same slurm.conf,
> gres.conf, and cgroup.conf files which are soft linked into
> /etc/slurm-llnl from an NFS mounted filesystem.
>
> As per the subject, the nodes refuse to revert to idle:
>
> -----------------------------------------------------------
> root at hypnotoad:~# sinfo -N -l
> Thu Aug 24 10:01:20 2023
> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> dgx-2 1 dgx drained 80 80:1:1 500000 0
> 1 (null) gres/gpu count repor
> dgx-3 1 dgx drained 80 80:1:1 500000 0
> 1 (null) gres/gpu count repor
> dgx-4 1 dgx drained 80 80:1:1 500000 0
> 1 (null) gres/gpu count
> ...
> titan-3 1 titans* drained 40 40:1:1 250000 0
> 1 (null) gres/gpu count report
> ...
> -----------------------------------------------------------
>
> Neither of these commands has any effect:
>
> scontrol update NodeName=dgx-[2-6] State=RESUME
> scontrol update state=idle nodename=dgx-[2-6]
>
>
> When I check the slurmctld log I find this helpful information:
>
> -----------------------------------------------------------
> ...
> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
> node=dgx-4: Invalid argument
> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
> node=dgx-2: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-12: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-11: Invalid argument
> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
> node=dgx-6: Invalid argument
> ...
> -----------------------------------------------------------
>
> Googling, this appears to indicate that there is a resource mismatch
> between the actual hardware and what is specified in slurm.conf. Note
> that the existing configuration worked for Slurm 17, but I checked, and
> it looks fine to me:
>
> Relevant parts of slurm.conf:
>
> -----------------------------------------------------------
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
>
> PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
> MaxTime=UNLIMITED
> PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
>
> GresTypes=gpu
> NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
> NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
> NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
> -----------------------------------------------------------
>
> All the nodes in the titan partition are identical hardware, as are the
> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
> longer under warranty. So, using a couple of representative nodes:
>
> root at dgx-4:~# slurmd -C
> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
> ThreadsPerCore=2 RealMemory=515846
>
> root at titan-8:~# slurmd -C
> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
> ThreadsPerCore=2 RealMemory=257811
>
>
> I'm at a loss for how to debug this and am looking suggestions. Since
> the resources on these machines are strictly dedicated to Slurm jobs,
> would it be best to use the output of `slurmd -C` directly for the right
> hand side of NodeName, reducing the memory a bit for OS overhead? Is
> there any way to get better debugging output? "Invalid argument" doesn't
> tell me much.
>
> Thanks.
>
>
>
>
>
>
>
>
> This message is from an external sender. Learn more about why this
> matters.
> <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>
>
More information about the slurm-users
mailing list