[slurm-users] Sharding not working correctly if several gpu types are defined

Fri Jan 6 12:37:15 UTC 2023

Another update:

Sorry, my bad!
This is already part of the Gres documentation:

"""
For Type to match a system-detected device, it must either exactly match or be a substring of the GPU name reported by slurmd via the AutoDetect mechanism. This GPU name will have all spaces replaced with underscores. To see the GPU name, set SlurmdDebug=debug2 in your slurm.conf and either restart or reconfigure slurmd and check the slurmd log.
"""
Only thing that is still not clear to me is that it also doesn't work if I remove the AutoDetect=nvml line from gres.conf.

Cheers, and have a nice weekend

Esben

________________________________
From: EPF (Esben Peter Friis) <EPF at novozymes.com>
Sent: Thursday, January 5, 2023 17:14
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: Sharding not working correctly if several gpu types are defined

Update:

If I call the smaller card "Quadro" rather that "RTX5000", is works correctly

in slurm.comf:

NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:Quadro:1,shard:88 Feature=gpu,ht

is gres.conf:

AutoDetect=nvml
Name=gpu Type=A5000  File=/dev/nvidia0
Name=gpu Type=A5000  File=/dev/nvidia1
Name=gpu Type=Quadro File=/dev/nvidia2
Name=gpu Type=A5000  File=/dev/nvidia3
Name=shard Count=24  File=/dev/nvidia0
Name=shard Count=24  File=/dev/nvidia1
Name=shard Count=16  File=/dev/nvidia2
Name=shard Count=24  File=/dev/nvidia3

Does the name string have to be (part of) what nvidia-smi or NVML reports?

Cheers,

Esben

________________________________
From: EPF (Esben Peter Friis) <EPF at novozymes.com>
Sent: Thursday, January 5, 2023 16:51
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Sharding not working correctly if several gpu types are defined

Really great that there is now a way to share GPUs between several jobs  - even with  several GPUs per host. Thanks for adding this feature!

I have compiled (against cuda 11.8) and installed 22.05.7.
The test system is one host with 4 GPUS (3 x Nvidia A5000 + 1 x Nivida RTX5000)

nvidia-smi reports this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:02:00.0 Off |                  Off |
| 42%   62C    P2    88W / 230W |    207MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:03:00.0 Off |                  Off |
| 45%   61C    P5    80W / 230W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 5000     On   | 00000000:83:00.0 Off |                  Off |
| 51%   63C    P0    67W / 230W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:84:00.0 Off |                  Off |
| 31%   52C    P0    64W / 230W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

My gres.conf is this. The RTX5000 has less memory, so we configure it with less shards:

AutoDetect=nvml
Name=gpu Type=A5000  File=/dev/nvidia0
Name=gpu Type=A5000  File=/dev/nvidia1
Name=gpu Type=RTX5000 File=/dev/nvidia2
Name=gpu Type=A5000  File=/dev/nvidia3
Name=shard Count=24  File=/dev/nvidia0
Name=shard Count=24  File=/dev/nvidia1
Name=shard Count=16  File=/dev/nvidia2
Name=shard Count=24  File=/dev/nvidia3

if I don't configure gpus by type - like this in slurm.conf:

NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:4,shard:88 Feature=gpu,ht

and run 7 jobs, each requesting 12 shards, it works exacly as expected: 2 jobs on each of the A5000's and one job on the RTX5000. (Subsequent jobs requesting 12 shards are correctly queued)

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    904552      C   ...ing_proj/.venv/bin/python      204MiB |
|    0   N/A  N/A   1160663      C   ...-2020-ubuntu20.04/bin/gmx      260MiB |
|    0   N/A  N/A   1160758      C   ...-2020-ubuntu20.04/bin/gmx      254MiB |
|    1   N/A  N/A   1160643      C   ...-2020-ubuntu20.04/bin/gmx      262MiB |
|    1   N/A  N/A   1160647      C   ...-2020-ubuntu20.04/bin/gmx      256MiB |
|    2   N/A  N/A   1160659      C   ...-2020-ubuntu20.04/bin/gmx      174MiB |
|    3   N/A  N/A   1160644      C   ...-2020-ubuntu20.04/bin/gmx      248MiB |
|    3   N/A  N/A   1160755      C   ...-2020-ubuntu20.04/bin/gmx      260MiB |
+-----------------------------------------------------------------------------+

That's great!
If we run jobs requiring one or more full GPUs, ee would like to be able to request specific GPU types as well

But if I configure the gpus also by name like this in slurm.conf:

NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:RTX5000:1,shard:88 Feature=gpu,ht

and run 7 jobs, each requesting 12 shards, It does NOT Work. It starts 2 jobs on the first two A5000's, two job on the RTX5000, and one job on last A5000. Strangely, it still knows that it should not start more jobs - subsequent jobs are still queued.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    904552      C   ...ing_proj/.venv/bin/python      204MiB |
|    0   N/A  N/A   1176564      C   ...-2020-ubuntu20.04/bin/gmx      258MiB |
|    0   N/A  N/A   1176565      C   ...-2020-ubuntu20.04/bin/gmx      258MiB |
|    1   N/A  N/A   1176562      C   ...-2020-ubuntu20.04/bin/gmx      258MiB |
|    1   N/A  N/A   1176566      C   ...-2020-ubuntu20.04/bin/gmx      258MiB |
|    2   N/A  N/A   1176560      C   ...-2020-ubuntu20.04/bin/gmx      172MiB |
|    2   N/A  N/A   1176561      C   ...-2020-ubuntu20.04/bin/gmx      172MiB |
|    3   N/A  N/A   1176563      C   ...-2020-ubuntu20.04/bin/gmx      258MiB |
+-----------------------------------------------------------------------------+

It is also strange that "scontrol show node" seems to list the shards correctly, even in this case:

NodeName=koala Arch=x86_64 CoresPerSocket=14
   CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=22.16
   AvailableFeatures=gpu,ht
   ActiveFeatures=gpu,ht
   Gres=gpu:A5000:3(S:0-1),gpu:RTX5000:1(S:0-1),shard:A5000:72(S:0-1),shard:RTX5000:16(S:0-1)
   NodeAddr=10.194.132.190 NodeHostName=koala Version=22.05.7
   OS=Linux 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022
   RealMemory=1 AllocMem=0 FreeMem=390036 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=urgent,high,medium,low
   BootTime=2023-01-03T12:37:17 SlurmdStartTime=2023-01-05T16:24:53
   LastBusyTime=2023-01-05T16:37:24
   CfgTRES=cpu=56,mem=1M,billing=56,gres/gpu=4,gres/shard=88
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

In all cases, my jobs are submitted with commands like this:

sbatch --gres=shard:12 --wrap 'bash -c " ... (command goes here) ... "'

The behavior is very consistent. I have played around with adding CUDA_DEVICE_ORDER=PCI_BUS_ID to the environment of slurmd and slurmctld, but it makes no difference.

Is this a bug or a feature?

Cheers,

Esben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230106/ec6d89f8/attachment-0001.htm>