[slurm-users] Sharding not working correctly if several gpu types are defined
EPF (Esben Peter Friis)
EPF at novozymes.com
Fri Jan 6 12:37:15 UTC 2023
Another update:
Sorry, my bad!
This is already part of the Gres documentation:
"""
For Type to match a system-detected device, it must either exactly match or be a substring of the GPU name reported by slurmd via the AutoDetect mechanism. This GPU name will have all spaces replaced with underscores. To see the GPU name, set SlurmdDebug=debug2 in your slurm.conf and either restart or reconfigure slurmd and check the slurmd log.
"""
Only thing that is still not clear to me is that it also doesn't work if I remove the AutoDetect=nvml line from gres.conf.
Cheers, and have a nice weekend
Esben
________________________________
From: EPF (Esben Peter Friis) <EPF at novozymes.com>
Sent: Thursday, January 5, 2023 17:14
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: Sharding not working correctly if several gpu types are defined
Update:
If I call the smaller card "Quadro" rather that "RTX5000", is works correctly
in slurm.comf:
NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:Quadro:1,shard:88 Feature=gpu,ht
is gres.conf:
AutoDetect=nvml
Name=gpu Type=A5000 File=/dev/nvidia0
Name=gpu Type=A5000 File=/dev/nvidia1
Name=gpu Type=Quadro File=/dev/nvidia2
Name=gpu Type=A5000 File=/dev/nvidia3
Name=shard Count=24 File=/dev/nvidia0
Name=shard Count=24 File=/dev/nvidia1
Name=shard Count=16 File=/dev/nvidia2
Name=shard Count=24 File=/dev/nvidia3
Does the name string have to be (part of) what nvidia-smi or NVML reports?
Cheers,
Esben
________________________________
From: EPF (Esben Peter Friis) <EPF at novozymes.com>
Sent: Thursday, January 5, 2023 16:51
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Sharding not working correctly if several gpu types are defined
Really great that there is now a way to share GPUs between several jobs - even with several GPUs per host. Thanks for adding this feature!
I have compiled (against cuda 11.8) and installed 22.05.7.
The test system is one host with 4 GPUS (3 x Nvidia A5000 + 1 x Nivida RTX5000)
nvidia-smi reports this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:02:00.0 Off | Off |
| 42% 62C P2 88W / 230W | 207MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:03:00.0 Off | Off |
| 45% 61C P5 80W / 230W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 5000 On | 00000000:83:00.0 Off | Off |
| 51% 63C P0 67W / 230W | 3MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 On | 00000000:84:00.0 Off | Off |
| 31% 52C P0 64W / 230W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
My gres.conf is this. The RTX5000 has less memory, so we configure it with less shards:
AutoDetect=nvml
Name=gpu Type=A5000 File=/dev/nvidia0
Name=gpu Type=A5000 File=/dev/nvidia1
Name=gpu Type=RTX5000 File=/dev/nvidia2
Name=gpu Type=A5000 File=/dev/nvidia3
Name=shard Count=24 File=/dev/nvidia0
Name=shard Count=24 File=/dev/nvidia1
Name=shard Count=16 File=/dev/nvidia2
Name=shard Count=24 File=/dev/nvidia3
if I don't configure gpus by type - like this in slurm.conf:
NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:4,shard:88 Feature=gpu,ht
and run 7 jobs, each requesting 12 shards, it works exacly as expected: 2 jobs on each of the A5000's and one job on the RTX5000. (Subsequent jobs requesting 12 shards are correctly queued)
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |
| 0 N/A N/A 1160663 C ...-2020-ubuntu20.04/bin/gmx 260MiB |
| 0 N/A N/A 1160758 C ...-2020-ubuntu20.04/bin/gmx 254MiB |
| 1 N/A N/A 1160643 C ...-2020-ubuntu20.04/bin/gmx 262MiB |
| 1 N/A N/A 1160647 C ...-2020-ubuntu20.04/bin/gmx 256MiB |
| 2 N/A N/A 1160659 C ...-2020-ubuntu20.04/bin/gmx 174MiB |
| 3 N/A N/A 1160644 C ...-2020-ubuntu20.04/bin/gmx 248MiB |
| 3 N/A N/A 1160755 C ...-2020-ubuntu20.04/bin/gmx 260MiB |
+-----------------------------------------------------------------------------+
That's great!
If we run jobs requiring one or more full GPUs, ee would like to be able to request specific GPU types as well
But if I configure the gpus also by name like this in slurm.conf:
NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:RTX5000:1,shard:88 Feature=gpu,ht
and run 7 jobs, each requesting 12 shards, It does NOT Work. It starts 2 jobs on the first two A5000's, two job on the RTX5000, and one job on last A5000. Strangely, it still knows that it should not start more jobs - subsequent jobs are still queued.
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |
| 0 N/A N/A 1176564 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 0 N/A N/A 1176565 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 1 N/A N/A 1176562 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 1 N/A N/A 1176566 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 2 N/A N/A 1176560 C ...-2020-ubuntu20.04/bin/gmx 172MiB |
| 2 N/A N/A 1176561 C ...-2020-ubuntu20.04/bin/gmx 172MiB |
| 3 N/A N/A 1176563 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
+-----------------------------------------------------------------------------+
It is also strange that "scontrol show node" seems to list the shards correctly, even in this case:
NodeName=koala Arch=x86_64 CoresPerSocket=14
CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=22.16
AvailableFeatures=gpu,ht
ActiveFeatures=gpu,ht
Gres=gpu:A5000:3(S:0-1),gpu:RTX5000:1(S:0-1),shard:A5000:72(S:0-1),shard:RTX5000:16(S:0-1)
NodeAddr=10.194.132.190 NodeHostName=koala Version=22.05.7
OS=Linux 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=390036 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=urgent,high,medium,low
BootTime=2023-01-03T12:37:17 SlurmdStartTime=2023-01-05T16:24:53
LastBusyTime=2023-01-05T16:37:24
CfgTRES=cpu=56,mem=1M,billing=56,gres/gpu=4,gres/shard=88
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
In all cases, my jobs are submitted with commands like this:
sbatch --gres=shard:12 --wrap 'bash -c " ... (command goes here) ... "'
The behavior is very consistent. I have played around with adding CUDA_DEVICE_ORDER=PCI_BUS_ID to the environment of slurmd and slurmctld, but it makes no difference.
Is this a bug or a feature?
Cheers,
Esben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230106/ec6d89f8/attachment-0001.htm>
More information about the slurm-users
mailing list