[slurm-users] Nodes stay drained no matter what I do

Fri Aug 25 14:25:51 UTC 2023

Hi Tina -

Thanks for the confirmation! I will make this adjustment to gres.conf.

On 8/25/23 04:50, Tina Friedrich wrote:
> Hi Patrick,
> 
> we certainly use that information to set affinity, yes. Our gres.conf 
> files (node-specific, as our config management creates them locally from 
>   'nvidia-smi topo -m') - look like this:
> 
> Name=gpu Type=a100 File=/dev/nvidia0 CPUs=0-23
> Name=gpu Type=a100 File=/dev/nvidia1 CPUs=0-23
> Name=gpu Type=a100 File=/dev/nvidia2 CPUs=24-47
> Name=gpu Type=a100 File=/dev/nvidia3 CPUs=24-47
> 
> which means that the processor affinity is known, and you can request 
> GPUs as '--gres=gpu:a100:X'.
> 
> Tina
> 
> On 24/08/2023 23:17, Patrick Goetz wrote:
>> Hi Mick -
>>
>> Thanks for these suggestions.  I read over both release notes, but 
>> didn't find anything helpful.
>>
>> Note that I didn't include gres.conf in my original post.  That would 
>> be this:
>>
>>    NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
>>    NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
>>    NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]
>>
>> Everything is working now, but some schedmd comment alerted me to this 
>> highly useful command:
>>
>>    # nvidia-smi topo -m
>>
>> Now I'm wondering if I should be expressing CPU affinity explicitly in 
>> the gres.conf file.
>>
>>
>> On 8/24/23 11:24, Timony, Mick wrote:
>>> Hi Patrick,
>>>
>>> You may want to review the release notes for 19.05 and any 
>>> intermediate versions:
>>>
>>> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>
>>>
>>> https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES <https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>
>>>
>>> I'd also check the |slurmd.log| on the compute nodes. It's usually 
>>> in |/var/log/slurm/slurmd.log|
>>>
>>> I'm not 100% sure your gres.conf is correct, we use one gres.conf for 
>>> all our nodes, it looks something like this:
>>>
>>> |NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
>>> |NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
>>> |NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|
>>>
>>> SchedMd docs example is a little different as they have a unique 
>>> gres.conf by node in their example at:
>>>
>>> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>
>>>
>>> |Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1|
>>>
>>> I don't see |Name| in your |gres.conf|?
>>>
>>> Kind regards
>>>
>>> -- 
>>> Mick Timony
>>> Senior DevOps Engineer
>>> Harvard Medical School
>>> -- 
>>>
>>> ------------------------------------------------------------------------
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
>>> of Patrick Goetz <pgoetz at math.utexas.edu>
>>> *Sent:* Thursday, August 24, 2023 11:27 AM
>>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject:* [slurm-users] Nodes stay drained no matter what I do
>>>
>>> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
>>>
>>> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
>>> re-used the original slurm.conf (fearing this might cause issues).  The
>>> hardware is the same.  The Master and nodes all use the same slurm.conf,
>>> gres.conf, and cgroup.conf files which are soft linked into
>>> /etc/slurm-llnl from an NFS mounted filesystem.
>>>
>>> As per the subject, the nodes refuse to revert to idle:
>>>
>>> -----------------------------------------------------------
>>> root at hypnotoad:~# sinfo -N -l
>>> Thu Aug 24 10:01:20 2023
>>> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
>>> WEIGHT AVAIL_FE REASON
>>> dgx-2          1       dgx     drained   80   80:1:1 500000        0
>>>    1   (null) gres/gpu count repor
>>> dgx-3          1       dgx     drained   80   80:1:1 500000        0
>>>    1   (null) gres/gpu count repor
>>> dgx-4          1       dgx     drained   80   80:1:1 500000        0
>>>    1   (null) gres/gpu count
>>> ...
>>> titan-3        1   titans*     drained   40   40:1:1 250000        0
>>>    1   (null) gres/gpu count report
>>> ...
>>> -----------------------------------------------------------
>>>
>>> Neither of these commands has any effect:
>>>
>>>     scontrol update NodeName=dgx-[2-6] State=RESUME
>>>     scontrol update state=idle nodename=dgx-[2-6]
>>>
>>>
>>> When I check the slurmctld log I find this helpful information:
>>>
>>> -----------------------------------------------------------
>>> ...
>>> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
>>> node=dgx-4: Invalid argument
>>> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
>>> node=dgx-2: Invalid argument
>>> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
>>> node=titan-12: Invalid argument
>>> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
>>> node=titan-11: Invalid argument
>>> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
>>> node=dgx-6: Invalid argument
>>> ...
>>> -----------------------------------------------------------
>>>
>>> Googling, this appears to indicate that there is a resource mismatch
>>> between the actual hardware and what is specified in slurm.conf. Note
>>> that the existing configuration worked for Slurm 17, but I checked, and
>>> it looks fine to me:
>>>
>>> Relevant parts of slurm.conf:
>>>
>>> -----------------------------------------------------------
>>>     SchedulerType=sched/backfill
>>>     SelectType=select/cons_res
>>>     SelectTypeParameters=CR_Core_Memory
>>>
>>>     PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
>>> MaxTime=UNLIMITED
>>>     PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
>>>
>>>     GresTypes=gpu
>>>     NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
>>>     NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
>>>     NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
>>> -----------------------------------------------------------
>>>
>>> All the nodes in the titan partition are identical hardware, as are the
>>> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
>>> longer under warranty.  So, using a couple of representative nodes:
>>>
>>> root at dgx-4:~# slurmd -C
>>> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>>> ThreadsPerCore=2 RealMemory=515846
>>>
>>> root at titan-8:~# slurmd -C
>>> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>>> ThreadsPerCore=2 RealMemory=257811
>>>
>>>
>>> I'm at a loss for how to debug this and am looking suggestions. Since
>>> the resources on these machines are strictly dedicated to Slurm jobs,
>>> would it be best to use the output of `slurmd -C` directly for the right
>>> hand side of NodeName, reducing the memory a bit for OS overhead? Is
>>> there any way to get better debugging output? "Invalid argument" doesn't
>>> tell me much.
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> This message is from an external sender. Learn more about why this 
>>> matters. 
>>> <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>>>
>>>
>>
> 
>>> This message is from an external sender. Learn more about why this <<
>>> matters at https://links.utexas.edu/rtyclf.                        <<