[slurm-users] Nodes stay drained no matter what I do
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Fri Aug 25 09:50:42 UTC 2023
Hi Patrick,
we certainly use that information to set affinity, yes. Our gres.conf
files (node-specific, as our config management creates them locally from
'nvidia-smi topo -m') - look like this:
Name=gpu Type=a100 File=/dev/nvidia0 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia1 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia2 CPUs=24-47
Name=gpu Type=a100 File=/dev/nvidia3 CPUs=24-47
which means that the processor affinity is known, and you can request
GPUs as '--gres=gpu:a100:X'.
Tina
On 24/08/2023 23:17, Patrick Goetz wrote:
> Hi Mick -
>
> Thanks for these suggestions. I read over both release notes, but
> didn't find anything helpful.
>
> Note that I didn't include gres.conf in my original post. That would be
> this:
>
> NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
> NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
> NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]
>
> Everything is working now, but some schedmd comment alerted me to this
> highly useful command:
>
> # nvidia-smi topo -m
>
> Now I'm wondering if I should be expressing CPU affinity explicitly in
> the gres.conf file.
>
>
> On 8/24/23 11:24, Timony, Mick wrote:
>> Hi Patrick,
>>
>> You may want to review the release notes for 19.05 and any
>> intermediate versions:
>>
>> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES
>> <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>
>>
>> https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES
>> <https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>
>>
>> I'd also check the |slurmd.log| on the compute nodes. It's usually in
>> |/var/log/slurm/slurmd.log|
>>
>> I'm not 100% sure your gres.conf is correct, we use one gres.conf for
>> all our nodes, it looks something like this:
>>
>> |NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
>> |NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
>> |NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|
>>
>> SchedMd docs example is a little different as they have a unique
>> gres.conf by node in their example at:
>>
>> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>
>>
>> |Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1|
>>
>> I don't see |Name| in your |gres.conf|?
>>
>> Kind regards
>>
>> --
>> Mick Timony
>> Senior DevOps Engineer
>> Harvard Medical School
>> --
>>
>> ------------------------------------------------------------------------
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
>> of Patrick Goetz <pgoetz at math.utexas.edu>
>> *Sent:* Thursday, August 24, 2023 11:27 AM
>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject:* [slurm-users] Nodes stay drained no matter what I do
>>
>> Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
>>
>> This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
>> re-used the original slurm.conf (fearing this might cause issues). The
>> hardware is the same. The Master and nodes all use the same slurm.conf,
>> gres.conf, and cgroup.conf files which are soft linked into
>> /etc/slurm-llnl from an NFS mounted filesystem.
>>
>> As per the subject, the nodes refuse to revert to idle:
>>
>> -----------------------------------------------------------
>> root at hypnotoad:~# sinfo -N -l
>> Thu Aug 24 10:01:20 2023
>> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
>> WEIGHT AVAIL_FE REASON
>> dgx-2 1 dgx drained 80 80:1:1 500000 0
>> 1 (null) gres/gpu count repor
>> dgx-3 1 dgx drained 80 80:1:1 500000 0
>> 1 (null) gres/gpu count repor
>> dgx-4 1 dgx drained 80 80:1:1 500000 0
>> 1 (null) gres/gpu count
>> ...
>> titan-3 1 titans* drained 40 40:1:1 250000 0
>> 1 (null) gres/gpu count report
>> ...
>> -----------------------------------------------------------
>>
>> Neither of these commands has any effect:
>>
>> scontrol update NodeName=dgx-[2-6] State=RESUME
>> scontrol update state=idle nodename=dgx-[2-6]
>>
>>
>> When I check the slurmctld log I find this helpful information:
>>
>> -----------------------------------------------------------
>> ...
>> [2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
>> node=dgx-4: Invalid argument
>> [2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
>> node=dgx-2: Invalid argument
>> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
>> node=titan-12: Invalid argument
>> [2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
>> node=titan-11: Invalid argument
>> [2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
>> node=dgx-6: Invalid argument
>> ...
>> -----------------------------------------------------------
>>
>> Googling, this appears to indicate that there is a resource mismatch
>> between the actual hardware and what is specified in slurm.conf. Note
>> that the existing configuration worked for Slurm 17, but I checked, and
>> it looks fine to me:
>>
>> Relevant parts of slurm.conf:
>>
>> -----------------------------------------------------------
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_Core_Memory
>>
>> PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
>> MaxTime=UNLIMITED
>> PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED
>>
>> GresTypes=gpu
>> NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
>> NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
>> NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
>> -----------------------------------------------------------
>>
>> All the nodes in the titan partition are identical hardware, as are the
>> nodes in the dgx partition save for dgx-2, which lost a GPU and is no
>> longer under warranty. So, using a couple of representative nodes:
>>
>> root at dgx-4:~# slurmd -C
>> NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>> ThreadsPerCore=2 RealMemory=515846
>>
>> root at titan-8:~# slurmd -C
>> NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>> ThreadsPerCore=2 RealMemory=257811
>>
>>
>> I'm at a loss for how to debug this and am looking suggestions. Since
>> the resources on these machines are strictly dedicated to Slurm jobs,
>> would it be best to use the output of `slurmd -C` directly for the right
>> hand side of NodeName, reducing the memory a bit for OS overhead? Is
>> there any way to get better debugging output? "Invalid argument" doesn't
>> tell me much.
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>>
>>
>> This message is from an external sender. Learn more about why this
>> matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>>
>>
>
More information about the slurm-users
mailing list