Hi, We are testing the MIG deployment on our new slurm compute node with 4 x H100 GPUs. It looks like everything is configured correctly but we have a problem accessing mig devices. When I submit jobs requesting a mig gpu device #SBATCH --gres=gpu:H100_1g.10gb:1, the jobs get submitted to the node, but only 4 jobs get executed and all other jobs fail. I was able to "solve it" by adding code to submit to the mig device UUID. Am I missing something in the slurm/gres/cgroups configuration that would automatically assign correct CUDA_VISIBLE_DEVICES MIG UUID to the job?
If I submit 10 jobs requesting --gres=gpu:H100_1g.10gb:1
8 jobs start running: #1 - CUDA_VISIBLE_DEVICES=0 - jobs 1-4 run on one of the mig devices configured on GPU-device ... #8 - CUDA_VISIBLE_DEVICES=7 - jobs 5-8 fail with the message that there are no available CUDA_VISIBLE_DEVICES ... #10 - CUDA_VISIBLE_DEVICES=? - after waiting for free mig resources jobs 9-10 runs or fails depending if GPU-device is free (migs are free)
$ nvidia-smi --query-gpu=uuid --id=0 --format=csv,noheader GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3 $ nvidia-smi --query-gpu=uuid --id=7 --format=csv,noheader No devices were found
The slurm assigns CUDA_VISIBLE_DEVICES numbers based on the requested resources:
Request: gpu:H100_1g.10gb:1 - CUDA_VISIBLE_DEVICES=0 gpu:H100_1g.10gb:2 - CUDA_VISIBLE_DEVICES=0,1 ... Any job with H100_1g.10gb request is going to get device number between 0 and 7 gpu:H100_1g.10gb:8 - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
gpu:H100_2g.20gb:1 - CUDA_VISIBLE_DEVICES=8 Any job with H100_2g.20gb request is going to get device number between 8 and 10 ... gpu:H100_2g.10gb:3 - CUDA_VISIBLE_DEVICES=8,9,10 gpu:H100_3g.40gb:1 - CUDA_VISIBLE_DEVICES=11 gpu:H100_3g.40gb:2 - CUDA_VISIBLE_DEVICES=12 gpu:H100_7g.80gb:1 - CUDA_VISIBLE_DEVICES=13
I was able to "solve it" by adding this code to my slurm script:
# Get the CUDA index using srun and awk CUDA_INDEX=$(srun env | grep CUDA_VISIBLE_DEVICES | awk -F '=' '{print $2}')
## GPU0 1g.10gb ## MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a ## MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664 ## MIG-d3ab0675-d318-5e53-b487-b50695cf2e00 ## MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0 ## MIG-85dd76da-c994-5830-adf0-467c66ae1b95 ## MIG-29a6d43b-882e-5b79-868a-15bb2c770b82 ## MIG-558debd2-dc13-5406-9256-73ef4f279737 ## GPU1 2g.20gb ## MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f ## MIG-0a804c0c-aa27-5993-97cb-cedb854735ce ## MIG-8af9af6d-8720-5763-81e2-83afc43eb42b ## GPU1 1g.10gb ## MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a ## GPU2 3g.40gb ## MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b ## MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc ## GPU3 7g.80gb ## MIG-ec043869-9176-577e-bac0-46c8411e4e37
# Define the list of UUIDs UUIDS=( "MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a" "MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664" "MIG-d3ab0675-d318-5e53-b487-b50695cf2e00" "MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0" "MIG-85dd76da-c994-5830-adf0-467c66ae1b95" "MIG-29a6d43b-882e-5b79-868a-15bb2c770b82" "MIG-558debd2-dc13-5406-9256-73ef4f279737" "MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a" "MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f" "MIG-0a804c0c-aa27-5993-97cb-cedb854735ce" "MIG-8af9af6d-8720-5763-81e2-83afc43eb42b" "MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b" "MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc" "MIG-ec043869-9176-577e-bac0-46c8411e4e37" )
# Assign the UUID based on the CUDA index SELECTED_UUID=${UUIDS[$CUDA_INDEX]}
# Print the selected UUID echo $SELECTED_UUID
export CUDA_VISIBLE_DEVICES=$SELECTED_UUID
echo "CUDA_VISIBLE_DEVICES set to: "$SELECTED_UUID
echo "Test: "$CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=$SELECTED_UUID python3.8 gpu_script.py
Result:
Script start MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a CUDA_VISIBLE_DEVICES set to: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a Matrix calculation on CUDA device completed successfully. Script end
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 7 0 28686 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 8 0 28744 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 9 0 28300 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 10 0 28512 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 11 0 28506 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 12 0 28516 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 13 0 28511 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 1 9 0 28552 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | +---------------------------------------------------------------------------------------+
This is the current setup:
RockyLinux 9.2 slurm-22.05.9-1.el9.x86_64 CUDA Version: 12.1 Driver Version: 530.30.02
# nvidia-smi -L GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3) MIG 1g.10gb Device 0: (UUID: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a) MIG 1g.10gb Device 1: (UUID: MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664) MIG 1g.10gb Device 2: (UUID: MIG-d3ab0675-d318-5e53-b487-b50695cf2e00) MIG 1g.10gb Device 3: (UUID: MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0) MIG 1g.10gb Device 4: (UUID: MIG-85dd76da-c994-5830-adf0-467c66ae1b95) MIG 1g.10gb Device 5: (UUID: MIG-29a6d43b-882e-5b79-868a-15bb2c770b82) MIG 1g.10gb Device 6: (UUID: MIG-558debd2-dc13-5406-9256-73ef4f279737) GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-1439a39c-948b-0657-98e9-aff8595a8729) MIG 2g.20gb Device 0: (UUID: MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f) MIG 2g.20gb Device 1: (UUID: MIG-0a804c0c-aa27-5993-97cb-cedb854735ce) MIG 2g.20gb Device 2: (UUID: MIG-8af9af6d-8720-5763-81e2-83afc43eb42b) MIG 1g.10gb Device 3: (UUID: MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a) GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-3b182fed-bfdf-bc11-d1bd-c41fd1468d2a) MIG 3g.40gb Device 0: (UUID: MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b) MIG 3g.40gb Device 1: (UUID: MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc) GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-7eba9d78-e908-3db8-2633-a269aeec395e) MIG 7g.80gb Device 0: (UUID: MIG-ec043869-9176-577e-bac0-46c8411e4e37)
$ scontrol show node node16 NodeName=node16 Arch=x86_64 CoresPerSocket=1 CPUAlloc=1 CPUEfctv=112 CPUTot=112 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null)
Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1 NodeAddr=node16 NodeHostName=node16 Version=22.05.9 OS=Linux 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023 RealMemory=1030000 AllocMem=0 FreeMem=1011801 Sockets=112 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu_H100 BootTime=2024-01-18T13:52:27 SlurmdStartTime=2024-01-19T11:25:44 LastBusyTime=2024-01-19T10:59:13 CfgTRES=cpu=112,mem=1030000M,billing=112,gres/gpu=14 AllocTRES=cpu=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
$ scontrol show partition gpu_H100 PartitionName=gpu_H100 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node16 PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=112 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=112,mem=1030000M,node=1,billing=112,gres/gpu=14
slurm.conf:
AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:H100,gres/gpu:H100_1g.10gb,gres/gpu:H100_2g.20gb,gres/gpu:H100_3g.40gb,gres/gpu:H100_7g.80gb GresTypes=gpu ... NodeName=node16 CPUs=112 RealMemory=1030000 Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1 PartitionName=gpu_H100 Nodes=node16 Default=NO MaxTime=INFINITE State=UP
gres.conf
#AutoDetect=nvml # GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap66,/dev/nvidia-caps/nvidia-cap67 CPUs=0-55 # GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi8/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap75,/dev/nvidia-caps/nvidia-cap76 CPUs=0-55 # GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi9/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 CPUs=0-55 # GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi10/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 CPUs=0-55 # GPU 0 MIG 4 /proc/driver/nvidia/capabilities/gpu0/mig/gi11/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap102,/dev/nvidia-caps/nvidia-cap103 CPUs=0-55 # GPU 0 MIG 5 /proc/driver/nvidia/capabilities/gpu0/mig/gi12/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap111,/dev/nvidia-caps/nvidia-cap112 CPUs=0-55 # GPU 0 MIG 6 /proc/driver/nvidia/capabilities/gpu0/mig/gi13/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121 CPUs=0-55 # GPU 1 MIG 0 /proc/driver/nvidia/capabilities/gpu1/mig/gi3/access NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 CPUs=0-55 # GPU 1 MIG 1 /proc/driver/nvidia/capabilities/gpu1/mig/gi5/access NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap183,/dev/nvidia-caps/nvidia-cap184 CPUs=0-55 # GPU 1 MIG 2 /proc/driver/nvidia/capabilities/gpu1/mig/gi6/access NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap192,/dev/nvidia-caps/nvidia-cap193 CPUs=0-55 # GPU 1 MIG 3 /proc/driver/nvidia/capabilities/gpu1/mig/gi9/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 CPUs=0-55 # GPU 2 MIG 0 /proc/driver/nvidia/capabilities/gpu2/mig/gi1/access NodeName=node16 Name=gpu Type=H100_3g.40gb MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283 CPUs=0-55 # GPU 2 MIG 1 /proc/driver/nvidia/capabilities/gpu2/mig/gi2/access NodeName=node16 Name=gpu Type=H100_3g.40gb MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 CPUs=0-55 # GPU 3 MIG 0 /proc/driver/nvidia/capabilities/gpu3/mig/gi0/access NodeName=node16 Name=gpu Type=H100_7g.80gb MultipleFiles=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap408,/dev/nvidia-caps/nvidia-cap409 CPUs=0-55
[2024-01-19T14:31:13.701] debug: gres/gpu: init: loaded [2024-01-19T14:31:13.701] debug: gpu/generic: init: init: GPU Generic plugin loaded [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1 [2024-01-19T14:31:13.702] Gres Name=gpu Type=H100_7g.80gb Count=1
Best, Drazen Jalsovec