[slurm-users] Sharding not working correctly if several gpu types are defined
EPF (Esben Peter Friis)
EPF at novozymes.com
Thu Jan 5 16:14:16 UTC 2023
Update:
If I call the smaller card "Quadro" rather that "RTX5000", is works correctly
in slurm.comf:
NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:Quadro:1,shard:88 Feature=gpu,ht
is gres.conf:
AutoDetect=nvml
Name=gpu Type=A5000 File=/dev/nvidia0
Name=gpu Type=A5000 File=/dev/nvidia1
Name=gpu Type=Quadro File=/dev/nvidia2
Name=gpu Type=A5000 File=/dev/nvidia3
Name=shard Count=24 File=/dev/nvidia0
Name=shard Count=24 File=/dev/nvidia1
Name=shard Count=16 File=/dev/nvidia2
Name=shard Count=24 File=/dev/nvidia3
Does the name string have to be (part of) what nvidia-smi or NVML reports?
Cheers,
Esben
________________________________
From: EPF (Esben Peter Friis) <EPF at novozymes.com>
Sent: Thursday, January 5, 2023 16:51
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Sharding not working correctly if several gpu types are defined
Really great that there is now a way to share GPUs between several jobs - even with several GPUs per host. Thanks for adding this feature!
I have compiled (against cuda 11.8) and installed 22.05.7.
The test system is one host with 4 GPUS (3 x Nvidia A5000 + 1 x Nivida RTX5000)
nvidia-smi reports this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:02:00.0 Off | Off |
| 42% 62C P2 88W / 230W | 207MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:03:00.0 Off | Off |
| 45% 61C P5 80W / 230W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 5000 On | 00000000:83:00.0 Off | Off |
| 51% 63C P0 67W / 230W | 3MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 On | 00000000:84:00.0 Off | Off |
| 31% 52C P0 64W / 230W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
My gres.conf is this. The RTX5000 has less memory, so we configure it with less shards:
AutoDetect=nvml
Name=gpu Type=A5000 File=/dev/nvidia0
Name=gpu Type=A5000 File=/dev/nvidia1
Name=gpu Type=RTX5000 File=/dev/nvidia2
Name=gpu Type=A5000 File=/dev/nvidia3
Name=shard Count=24 File=/dev/nvidia0
Name=shard Count=24 File=/dev/nvidia1
Name=shard Count=16 File=/dev/nvidia2
Name=shard Count=24 File=/dev/nvidia3
if I don't configure gpus by type - like this in slurm.conf:
NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:4,shard:88 Feature=gpu,ht
and run 7 jobs, each requesting 12 shards, it works exacly as expected: 2 jobs on each of the A5000's and one job on the RTX5000. (Subsequent jobs requesting 12 shards are correctly queued)
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |
| 0 N/A N/A 1160663 C ...-2020-ubuntu20.04/bin/gmx 260MiB |
| 0 N/A N/A 1160758 C ...-2020-ubuntu20.04/bin/gmx 254MiB |
| 1 N/A N/A 1160643 C ...-2020-ubuntu20.04/bin/gmx 262MiB |
| 1 N/A N/A 1160647 C ...-2020-ubuntu20.04/bin/gmx 256MiB |
| 2 N/A N/A 1160659 C ...-2020-ubuntu20.04/bin/gmx 174MiB |
| 3 N/A N/A 1160644 C ...-2020-ubuntu20.04/bin/gmx 248MiB |
| 3 N/A N/A 1160755 C ...-2020-ubuntu20.04/bin/gmx 260MiB |
+-----------------------------------------------------------------------------+
That's great!
If we run jobs requiring one or more full GPUs, ee would like to be able to request specific GPU types as well
But if I configure the gpus also by name like this in slurm.conf:
NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:RTX5000:1,shard:88 Feature=gpu,ht
and run 7 jobs, each requesting 12 shards, It does NOT Work. It starts 2 jobs on the first two A5000's, two job on the RTX5000, and one job on last A5000. Strangely, it still knows that it should not start more jobs - subsequent jobs are still queued.
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |
| 0 N/A N/A 1176564 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 0 N/A N/A 1176565 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 1 N/A N/A 1176562 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 1 N/A N/A 1176566 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
| 2 N/A N/A 1176560 C ...-2020-ubuntu20.04/bin/gmx 172MiB |
| 2 N/A N/A 1176561 C ...-2020-ubuntu20.04/bin/gmx 172MiB |
| 3 N/A N/A 1176563 C ...-2020-ubuntu20.04/bin/gmx 258MiB |
+-----------------------------------------------------------------------------+
It is also strange that "scontrol show node" seems to list the shards correctly, even in this case:
NodeName=koala Arch=x86_64 CoresPerSocket=14
CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=22.16
AvailableFeatures=gpu,ht
ActiveFeatures=gpu,ht
Gres=gpu:A5000:3(S:0-1),gpu:RTX5000:1(S:0-1),shard:A5000:72(S:0-1),shard:RTX5000:16(S:0-1)
NodeAddr=10.194.132.190 NodeHostName=koala Version=22.05.7
OS=Linux 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=390036 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=urgent,high,medium,low
BootTime=2023-01-03T12:37:17 SlurmdStartTime=2023-01-05T16:24:53
LastBusyTime=2023-01-05T16:37:24
CfgTRES=cpu=56,mem=1M,billing=56,gres/gpu=4,gres/shard=88
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
In all cases, my jobs are submitted with commands like this:
sbatch --gres=shard:12 --wrap 'bash -c " ... (command goes here) ... "'
The behavior is very consistent. I have played around with adding CUDA_DEVICE_ORDER=PCI_BUS_ID to the environment of slurmd and slurmctld, but it makes no difference.
Is this a bug or a feature?
Cheers,
Esben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230105/96f01ea0/attachment-0001.htm>
More information about the slurm-users
mailing list