Hi,
We are testing the MIG deployment on our new slurm compute node with 4 x
H100 GPUs. It looks like everything is configured correctly but we have a
problem accessing mig devices. When I submit jobs requesting a mig gpu
device #SBATCH --gres=gpu:H100_1g.10gb:1, the jobs get submitted to the
node, but only 4 jobs get executed and all other jobs fail. I was able to
"solve it" by adding code to submit to the mig device UUID. Am I missing
something in the slurm/gres/cgroups configuration that would automatically
assign correct CUDA_VISIBLE_DEVICES MIG UUID to the job?
If I submit 10 jobs requesting --gres=gpu:H100_1g.10gb:1
8 jobs start running:
#1 - CUDA_VISIBLE_DEVICES=0 - jobs 1-4 run on one of the mig devices
configured on GPU-device
...
#8 - CUDA_VISIBLE_DEVICES=7 - jobs 5-8 fail with the message that there are
no available CUDA_VISIBLE_DEVICES
...
#10 - CUDA_VISIBLE_DEVICES=? - after waiting for free mig resources jobs
9-10 runs or fails depending if GPU-device is free (migs are free)
$ nvidia-smi --query-gpu=uuid --id=0 --format=csv,noheader
GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3
$ nvidia-smi --query-gpu=uuid --id=7 --format=csv,noheader
No devices were found
The slurm assigns CUDA_VISIBLE_DEVICES numbers based on the requested
resources:
Request:
gpu:H100_1g.10gb:1 - CUDA_VISIBLE_DEVICES=0
gpu:H100_1g.10gb:2 - CUDA_VISIBLE_DEVICES=0,1
...
Any job with H100_1g.10gb request is going to get device number between 0
and 7
gpu:H100_1g.10gb:8 - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
gpu:H100_2g.20gb:1 - CUDA_VISIBLE_DEVICES=8
Any job with H100_2g.20gb request is going to get device number between 8
and 10
...
gpu:H100_2g.10gb:3 - CUDA_VISIBLE_DEVICES=8,9,10
gpu:H100_3g.40gb:1 - CUDA_VISIBLE_DEVICES=11
gpu:H100_3g.40gb:2 - CUDA_VISIBLE_DEVICES=12
gpu:H100_7g.80gb:1 - CUDA_VISIBLE_DEVICES=13
I was able to "solve it" by adding this code to my slurm script:
# Get the CUDA index using srun and awk
CUDA_INDEX=$(srun env | grep CUDA_VISIBLE_DEVICES | awk -F '=' '{print $2}')
## GPU0 1g.10gb
## MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a
## MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664
## MIG-d3ab0675-d318-5e53-b487-b50695cf2e00
## MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0
## MIG-85dd76da-c994-5830-adf0-467c66ae1b95
## MIG-29a6d43b-882e-5b79-868a-15bb2c770b82
## MIG-558debd2-dc13-5406-9256-73ef4f279737
## GPU1 2g.20gb
## MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f
## MIG-0a804c0c-aa27-5993-97cb-cedb854735ce
## MIG-8af9af6d-8720-5763-81e2-83afc43eb42b
## GPU1 1g.10gb
## MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a
## GPU2 3g.40gb
## MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b
## MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc
## GPU3 7g.80gb
## MIG-ec043869-9176-577e-bac0-46c8411e4e37
# Define the list of UUIDs
UUIDS=(
"MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a"
"MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664"
"MIG-d3ab0675-d318-5e53-b487-b50695cf2e00"
"MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0"
"MIG-85dd76da-c994-5830-adf0-467c66ae1b95"
"MIG-29a6d43b-882e-5b79-868a-15bb2c770b82"
"MIG-558debd2-dc13-5406-9256-73ef4f279737"
"MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a"
"MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f"
"MIG-0a804c0c-aa27-5993-97cb-cedb854735ce"
"MIG-8af9af6d-8720-5763-81e2-83afc43eb42b"
"MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b"
"MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc"
"MIG-ec043869-9176-577e-bac0-46c8411e4e37"
)
# Assign the UUID based on the CUDA index
SELECTED_UUID=${UUIDS[$CUDA_INDEX]}
# Print the selected UUID
echo $SELECTED_UUID
export CUDA_VISIBLE_DEVICES=$SELECTED_UUID
echo "CUDA_VISIBLE_DEVICES set to: "$SELECTED_UUID
echo "Test: "$CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=$SELECTED_UUID python3.8 gpu_script.py
Result:
Script start
MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a
CUDA_VISIBLE_DEVICES set to: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a
Matrix calculation on CUDA device completed successfully.
Script end
+---------------------------------------------------------------------------------------+
| Processes:
|
| GPU GI CI PID Type Process name
GPU Memory |
| ID ID
Usage |
|=======================================================================================|
| 0 7 0 28686 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 0 8 0 28744 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 0 9 0 28300 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 0 10 0 28512 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 0 11 0 28506 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 0 12 0 28516 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 0 13 0 28511 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
| 1 9 0 28552 C ...gpu-cuda12.1-python38/bin/python3.8
866MiB |
+---------------------------------------------------------------------------------------+
This is the current setup:
RockyLinux 9.2
slurm-22.05.9-1.el9.x86_64
CUDA Version: 12.1 Driver Version: 530.30.02
# nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID:
GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3)
MIG 1g.10gb Device 0: (UUID:
MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a)
MIG 1g.10gb Device 1: (UUID:
MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664)
MIG 1g.10gb Device 2: (UUID:
MIG-d3ab0675-d318-5e53-b487-b50695cf2e00)
MIG 1g.10gb Device 3: (UUID:
MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0)
MIG 1g.10gb Device 4: (UUID:
MIG-85dd76da-c994-5830-adf0-467c66ae1b95)
MIG 1g.10gb Device 5: (UUID:
MIG-29a6d43b-882e-5b79-868a-15bb2c770b82)
MIG 1g.10gb Device 6: (UUID:
MIG-558debd2-dc13-5406-9256-73ef4f279737)
GPU 1: NVIDIA H100 80GB HBM3 (UUID:
GPU-1439a39c-948b-0657-98e9-aff8595a8729)
MIG 2g.20gb Device 0: (UUID:
MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f)
MIG 2g.20gb Device 1: (UUID:
MIG-0a804c0c-aa27-5993-97cb-cedb854735ce)
MIG 2g.20gb Device 2: (UUID:
MIG-8af9af6d-8720-5763-81e2-83afc43eb42b)
MIG 1g.10gb Device 3: (UUID:
MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a)
GPU 2: NVIDIA H100 80GB HBM3 (UUID:
GPU-3b182fed-bfdf-bc11-d1bd-c41fd1468d2a)
MIG 3g.40gb Device 0: (UUID:
MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b)
MIG 3g.40gb Device 1: (UUID:
MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc)
GPU 3: NVIDIA H100 80GB HBM3 (UUID:
GPU-7eba9d78-e908-3db8-2633-a269aeec395e)
MIG 7g.80gb Device 0: (UUID:
MIG-ec043869-9176-577e-bac0-46c8411e4e37)
$ scontrol show node node16
NodeName=node16 Arch=x86_64 CoresPerSocket=1
CPUAlloc=1 CPUEfctv=112 CPUTot=112 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1
NodeAddr=node16 NodeHostName=node16 Version=22.05.9
OS=Linux 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16
09:55:41 UTC 2023
RealMemory=1030000 AllocMem=0 FreeMem=1011801 Sockets=112 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=gpu_H100
BootTime=2024-01-18T13:52:27 SlurmdStartTime=2024-01-19T11:25:44
LastBusyTime=2024-01-19T10:59:13
CfgTRES=cpu=112,mem=1030000M,billing=112,gres/gpu=14
AllocTRES=cpu=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
$ scontrol show partition gpu_H100
PartitionName=gpu_H100
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=node16
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=112 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=112,mem=1030000M,node=1,billing=112,gres/gpu=14
slurm.conf:
AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:H100,gres/gpu:H100_1g.10gb,gres/gpu:H100_2g.20gb,gres/gpu:H100_3g.40gb,gres/gpu:H100_7g.80gb
GresTypes=gpu
...
NodeName=node16 CPUs=112 RealMemory=1030000
Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1
PartitionName=gpu_H100 Nodes=node16 Default=NO MaxTime=INFINITE State=UP
gres.conf
#AutoDetect=nvml
# GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap66,/dev/nvidia-caps/nvidia-cap67
CPUs=0-55
# GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi8/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap75,/dev/nvidia-caps/nvidia-cap76
CPUs=0-55
# GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi9/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85
CPUs=0-55
# GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi10/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94
CPUs=0-55
# GPU 0 MIG 4 /proc/driver/nvidia/capabilities/gpu0/mig/gi11/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap102,/dev/nvidia-caps/nvidia-cap103
CPUs=0-55
# GPU 0 MIG 5 /proc/driver/nvidia/capabilities/gpu0/mig/gi12/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap111,/dev/nvidia-caps/nvidia-cap112
CPUs=0-55
# GPU 0 MIG 6 /proc/driver/nvidia/capabilities/gpu0/mig/gi13/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121
CPUs=0-55
# GPU 1 MIG 0 /proc/driver/nvidia/capabilities/gpu1/mig/gi3/access
NodeName=node16 Name=gpu Type=H100_2g.20gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166
CPUs=0-55
# GPU 1 MIG 1 /proc/driver/nvidia/capabilities/gpu1/mig/gi5/access
NodeName=node16 Name=gpu Type=H100_2g.20gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap183,/dev/nvidia-caps/nvidia-cap184
CPUs=0-55
# GPU 1 MIG 2 /proc/driver/nvidia/capabilities/gpu1/mig/gi6/access
NodeName=node16 Name=gpu Type=H100_2g.20gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap192,/dev/nvidia-caps/nvidia-cap193
CPUs=0-55
# GPU 1 MIG 3 /proc/driver/nvidia/capabilities/gpu1/mig/gi9/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220
CPUs=0-55
# GPU 2 MIG 0 /proc/driver/nvidia/capabilities/gpu2/mig/gi1/access
NodeName=node16 Name=gpu Type=H100_3g.40gb
MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283
CPUs=0-55
# GPU 2 MIG 1 /proc/driver/nvidia/capabilities/gpu2/mig/gi2/access
NodeName=node16 Name=gpu Type=H100_3g.40gb
MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292
CPUs=0-55
# GPU 3 MIG 0 /proc/driver/nvidia/capabilities/gpu3/mig/gi0/access
NodeName=node16 Name=gpu Type=H100_7g.80gb
MultipleFiles=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap408,/dev/nvidia-caps/nvidia-cap409
CPUs=0-55
[2024-01-19T14:31:13.701] debug: gres/gpu: init: loaded
[2024-01-19T14:31:13.701] debug: gpu/generic: init: init: GPU Generic
plugin loaded
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1
[2024-01-19T14:31:13.702] Gres Name=gpu Type=H100_7g.80gb Count=1
Best,
Drazen Jalsovec