[slurm-users] Nodes stay drained no matter what I do

Thu Aug 24 22:17:48 UTC 2023

Hi Mick -

Thanks for these suggestions.  I read over both release notes, but 
didn't find anything helpful.

Note that I didn't include gres.conf in my original post.  That would be 
this:

   NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
   NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
   NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]

Everything is working now, but some schedmd comment alerted me to this 
highly useful command:

   # nvidia-smi topo -m

Now I'm wondering if I should be expressing CPU affinity explicitly in 
the gres.conf file.

On 8/24/23 11:24, Timony, Mick wrote:
> Hi Patrick,
> 
> You may want to review the release notes for 19.05 and any intermediate 
> versions:
> 
> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES 
> <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>
> 
> https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES 
> <https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>
> 
> I'd also check the |slurmd.log| on the compute nodes. It's usually in 
> |/var/log/slurm/slurmd.log|
> 
> I'm not 100% sure your gres.conf is correct, we use one gres.conf for 
> all our nodes, it looks something like this:
> 
> |NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
> |NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
> |NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|
> 
> SchedMd docs example is a little different as they have a unique 
> gres.conf by node in their example at:
> 
> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>
> 
> |Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1|
> 
> I don't see |Name| in your |gres.conf|?
> 
> Kind regards
> 
> -- 
> Mick Timony
> Senior DevOps Engineer
> Harvard Medical School
> --
> 
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of 
> Patrick Goetz <pgoetz at math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do
> 
> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
> 
> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
> re-used the original slurm.conf (fearing this might cause issues).  The
> hardware is the same.  The Master and nodes all use the same slurm.conf,
> gres.conf, and cgroup.conf files which are soft linked into
> /etc/slurm-llnl from an NFS mounted filesystem.
> 
> As per the subject, the nodes refuse to revert to idle:
> 
> -----------------------------------------------------------
> root at hypnotoad:~# sinfo -N -l
> Thu Aug 24 10:01:20 2023
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> dgx-2          1       dgx     drained   80   80:1:1 500000        0
>    1   (null) gres/gpu count repor
> dgx-3          1       dgx     drained   80   80:1:1 500000        0
>    1   (null) gres/gpu count repor
> dgx-4          1       dgx     drained   80   80:1:1 500000        0
>    1   (null) gres/gpu count
> ...
> titan-3        1   titans*     drained   40   40:1:1 250000        0
>    1   (null) gres/gpu count report
> ...
> -----------------------------------------------------------
> 
> Neither of these commands has any effect:
> 
>     scontrol update NodeName=dgx-[2-6] State=RESUME
>     scontrol update state=idle nodename=dgx-[2-6]
> 
> 
> When I check the slurmctld log I find this helpful information:
> 
> -----------------------------------------------------------
> ...
> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
> node=dgx-4: Invalid argument
> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
> node=dgx-2: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-12: Invalid argument
> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
> node=titan-11: Invalid argument
> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
> node=dgx-6: Invalid argument
> ...
> -----------------------------------------------------------
> 
> Googling, this appears to indicate that there is a resource mismatch
> between the actual hardware and what is specified in slurm.conf. Note
> that the existing configuration worked for Slurm 17, but I checked, and
> it looks fine to me:
> 
> Relevant parts of slurm.conf:
> 
> -----------------------------------------------------------
>     SchedulerType=sched/backfill
>     SelectType=select/cons_res
>     SelectTypeParameters=CR_Core_Memory
> 
>     PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
> MaxTime=UNLIMITED
>     PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
> 
>     GresTypes=gpu
>     NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
>     NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
>     NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
> -----------------------------------------------------------
> 
> All the nodes in the titan partition are identical hardware, as are the
> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
> longer under warranty.  So, using a couple of representative nodes:
> 
> root at dgx-4:~# slurmd -C
> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
> ThreadsPerCore=2 RealMemory=515846
> 
> root at titan-8:~# slurmd -C
> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
> ThreadsPerCore=2 RealMemory=257811
> 
> 
> I'm at a loss for how to debug this and am looking suggestions. Since
> the resources on these machines are strictly dedicated to Slurm jobs,
> would it be best to use the output of `slurmd -C` directly for the right
> hand side of NodeName, reducing the memory a bit for OS overhead? Is
> there any way to get better debugging output? "Invalid argument" doesn't
> tell me much.
> 
> Thanks.
> 
> 
> 
> 
> 
> 
> 
> 
> This message is from an external sender. Learn more about why this 
> matters. 
> <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
> 
>