[slurm-users] Nodes stay drained no matter what I do

Fri Aug 25 09:50:42 UTC 2023

Hi Patrick,

we certainly use that information to set affinity, yes. Our gres.conf 
files (node-specific, as our config management creates them locally from 
  'nvidia-smi topo -m') - look like this:

Name=gpu Type=a100 File=/dev/nvidia0 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia1 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia2 CPUs=24-47
Name=gpu Type=a100 File=/dev/nvidia3 CPUs=24-47

which means that the processor affinity is known, and you can request 
GPUs as '--gres=gpu:a100:X'.

Tina

On 24/08/2023 23:17, Patrick Goetz wrote:
> Hi Mick -
> 
> Thanks for these suggestions.  I read over both release notes, but 
> didn't find anything helpful.
> 
> Note that I didn't include gres.conf in my original post.  That would be 
> this:
> 
>    NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
>    NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
>    NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]
> 
> Everything is working now, but some schedmd comment alerted me to this 
> highly useful command:
> 
>    # nvidia-smi topo -m
> 
> Now I'm wondering if I should be expressing CPU affinity explicitly in 
> the gres.conf file.
> 
> 
> On 8/24/23 11:24, Timony, Mick wrote:
>> Hi Patrick,
>>
>> You may want to review the release notes for 19.05 and any 
>> intermediate versions:
>>
>> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES 
>> <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>
>>
>> https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES 
>> <https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>
>>
>> I'd also check the |slurmd.log| on the compute nodes. It's usually in 
>> |/var/log/slurm/slurmd.log|
>>
>> I'm not 100% sure your gres.conf is correct, we use one gres.conf for 
>> all our nodes, it looks something like this:
>>
>> |NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
>> |NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
>> |NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|
>>
>> SchedMd docs example is a little different as they have a unique 
>> gres.conf by node in their example at:
>>
>> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>
>>
>> |Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1|
>>
>> I don't see |Name| in your |gres.conf|?
>>
>> Kind regards
>>
>> -- 
>> Mick Timony
>> Senior DevOps Engineer
>> Harvard Medical School
>> -- 
>>
>> ------------------------------------------------------------------------
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
>> of Patrick Goetz <pgoetz at math.utexas.edu>
>> *Sent:* Thursday, August 24, 2023 11:27 AM
>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject:* [slurm-users] Nodes stay drained no matter what I do
>>
>> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
>>
>> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
>> re-used the original slurm.conf (fearing this might cause issues).  The
>> hardware is the same.  The Master and nodes all use the same slurm.conf,
>> gres.conf, and cgroup.conf files which are soft linked into
>> /etc/slurm-llnl from an NFS mounted filesystem.
>>
>> As per the subject, the nodes refuse to revert to idle:
>>
>> -----------------------------------------------------------
>> root at hypnotoad:~# sinfo -N -l
>> Thu Aug 24 10:01:20 2023
>> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
>> WEIGHT AVAIL_FE REASON
>> dgx-2          1       dgx     drained   80   80:1:1 500000        0
>>    1   (null) gres/gpu count repor
>> dgx-3          1       dgx     drained   80   80:1:1 500000        0
>>    1   (null) gres/gpu count repor
>> dgx-4          1       dgx     drained   80   80:1:1 500000        0
>>    1   (null) gres/gpu count
>> ...
>> titan-3        1   titans*     drained   40   40:1:1 250000        0
>>    1   (null) gres/gpu count report
>> ...
>> -----------------------------------------------------------
>>
>> Neither of these commands has any effect:
>>
>>     scontrol update NodeName=dgx-[2-6] State=RESUME
>>     scontrol update state=idle nodename=dgx-[2-6]
>>
>>
>> When I check the slurmctld log I find this helpful information:
>>
>> -----------------------------------------------------------
>> ...
>> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
>> node=dgx-4: Invalid argument
>> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
>> node=dgx-2: Invalid argument
>> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
>> node=titan-12: Invalid argument
>> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
>> node=titan-11: Invalid argument
>> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
>> node=dgx-6: Invalid argument
>> ...
>> -----------------------------------------------------------
>>
>> Googling, this appears to indicate that there is a resource mismatch
>> between the actual hardware and what is specified in slurm.conf. Note
>> that the existing configuration worked for Slurm 17, but I checked, and
>> it looks fine to me:
>>
>> Relevant parts of slurm.conf:
>>
>> -----------------------------------------------------------
>>     SchedulerType=sched/backfill
>>     SelectType=select/cons_res
>>     SelectTypeParameters=CR_Core_Memory
>>
>>     PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
>> MaxTime=UNLIMITED
>>     PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
>>
>>     GresTypes=gpu
>>     NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
>>     NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
>>     NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
>> -----------------------------------------------------------
>>
>> All the nodes in the titan partition are identical hardware, as are the
>> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
>> longer under warranty.  So, using a couple of representative nodes:
>>
>> root at dgx-4:~# slurmd -C
>> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>> ThreadsPerCore=2 RealMemory=515846
>>
>> root at titan-8:~# slurmd -C
>> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>> ThreadsPerCore=2 RealMemory=257811
>>
>>
>> I'm at a loss for how to debug this and am looking suggestions. Since
>> the resources on these machines are strictly dedicated to Slurm jobs,
>> would it be best to use the output of `slurmd -C` directly for the right
>> hand side of NodeName, reducing the memory a bit for OS overhead? Is
>> there any way to get better debugging output? "Invalid argument" doesn't
>> tell me much.
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>>
>>
>> This message is from an external sender. Learn more about why this 
>> matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>>
>>
>