<div dir="ltr">Hi,<div>We are testing the MIG deployment on our new slurm compute node with 4 x H100 GPUs. It looks like everything is configured correctly but we have a problem accessing mig devices. When I submit jobs requesting a mig gpu device #SBATCH --gres=gpu:H100_1g.10gb:1, the jobs get submitted to the node, but only 4 jobs get executed and all other jobs fail. I was able to "solve it" by adding code to submit to the mig device UUID. Am I missing something in the slurm/gres/cgroups configuration that would automatically assign correct CUDA_VISIBLE_DEVICES MIG UUID to the job?</div><div><br></div><div>If I submit 10 jobs requesting --gres=gpu:H100_1g.10gb:1<br><br>8 jobs start running:</div><div>#1 - CUDA_VISIBLE_DEVICES=0 - jobs 1-4 run on one of the mig devices configured on GPU-device<br>...<br>#8 - CUDA_VISIBLE_DEVICES=7 - jobs 5-8 fail with the message that there are no available CUDA_VISIBLE_DEVICES<br>...</div><div>#10 - CUDA_VISIBLE_DEVICES=? - after waiting for free mig resources jobs 9-10 runs or fails depending if GPU-device is free (migs are free)<br><br>$ nvidia-smi --query-gpu=uuid --id=0 --format=csv,noheader<br>GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3<br>$ nvidia-smi --query-gpu=uuid --id=7 --format=csv,noheader<br>No devices were found<br></div><div><br></div><div>The slurm assigns CUDA_VISIBLE_DEVICES numbers based on the requested resources:<br><br>Request:</div><div>gpu:H100_1g.10gb:1 - CUDA_VISIBLE_DEVICES=0</div><div>gpu:H100_1g.10gb:2 - CUDA_VISIBLE_DEVICES=0,1</div><div>...</div><div>Any job with H100_1g.10gb request is going to get device number between 0 and 7</div><div>gpu:H100_1g.10gb:8 - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7<br></div><div><br></div><div>gpu:H100_2g.20gb:1 - CUDA_VISIBLE_DEVICES=8</div><div>Any job with H100_2g.20gb request is going to get device number between 8 and 10<br></div><div>...</div><div>gpu:H100_2g.10gb:3 - CUDA_VISIBLE_DEVICES=8,9,10</div><div>gpu:H100_3g.40gb:1 - CUDA_VISIBLE_DEVICES=11</div><div>gpu:H100_3g.40gb:2 - CUDA_VISIBLE_DEVICES=12</div><div>gpu:H100_7g.80gb:1 - CUDA_VISIBLE_DEVICES=13<br><br>I was able to "solve it" by adding this code to my slurm script:<br><br># Get the CUDA index using srun and awk<br>CUDA_INDEX=$(srun env | grep CUDA_VISIBLE_DEVICES | awk -F '=' '{print $2}')<br><br>## GPU0 1g.10gb<br>## MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a<br>## MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664<br>## MIG-d3ab0675-d318-5e53-b487-b50695cf2e00<br>## MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0<br>## MIG-85dd76da-c994-5830-adf0-467c66ae1b95<br>## MIG-29a6d43b-882e-5b79-868a-15bb2c770b82<br>## MIG-558debd2-dc13-5406-9256-73ef4f279737<br>## GPU1 2g.20gb<br>## MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f<br>## MIG-0a804c0c-aa27-5993-97cb-cedb854735ce<br>## MIG-8af9af6d-8720-5763-81e2-83afc43eb42b<br>## GPU1 1g.10gb<br>## MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a<br>## GPU2 3g.40gb<br>## MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b<br>## MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc<br>## GPU3 7g.80gb<br>## MIG-ec043869-9176-577e-bac0-46c8411e4e37<br><br># Define the list of UUIDs<br>UUIDS=(<br>  "MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a"<br>  "MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664"<br>  "MIG-d3ab0675-d318-5e53-b487-b50695cf2e00"<br>  "MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0"<br>  "MIG-85dd76da-c994-5830-adf0-467c66ae1b95"<br>  "MIG-29a6d43b-882e-5b79-868a-15bb2c770b82"<br>  "MIG-558debd2-dc13-5406-9256-73ef4f279737"<br>  "MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a"<br>  "MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f"<br>  "MIG-0a804c0c-aa27-5993-97cb-cedb854735ce"<br>  "MIG-8af9af6d-8720-5763-81e2-83afc43eb42b"<br>  "MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b"<br>  "MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc"<br>  "MIG-ec043869-9176-577e-bac0-46c8411e4e37"<br>)<br><br># Assign the UUID based on the CUDA index<br>SELECTED_UUID=${UUIDS[$CUDA_INDEX]}<br><br># Print the selected UUID<br>echo $SELECTED_UUID<br><br>export CUDA_VISIBLE_DEVICES=$SELECTED_UUID<br><br>echo "CUDA_VISIBLE_DEVICES set to: "$SELECTED_UUID<br><br>echo "Test: "$CUDA_VISIBLE_DEVICES<br><br>CUDA_VISIBLE_DEVICES=$SELECTED_UUID python3.8 gpu_script.py<br><br>Result:</div><div><br>Script start<br>MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a<br>CUDA_VISIBLE_DEVICES set to: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a<br>Matrix calculation on CUDA device completed successfully.<br>Script end<br></div><div><br>+---------------------------------------------------------------------------------------+<br>| Processes:                                                                            |<br>|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |<br>|        ID   ID                                                             Usage      |<br>|=======================================================================================|<br>|    0    7    0      28686      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    0    8    0      28744      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    0    9    0      28300      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    0   10    0      28512      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    0   11    0      28506      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    0   12    0      28516      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    0   13    0      28511      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>|    1    9    0      28552      C   ...gpu-cuda12.1-python38/bin/python3.8      866MiB |<br>+---------------------------------------------------------------------------------------+<br><br>This is the current setup:<br><br>RockyLinux 9.2<br>slurm-22.05.9-1.el9.x86_64<br>CUDA Version: 12.1 Driver Version: 530.30.02<br><br># nvidia-smi -L<br>GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3)<br>  MIG 1g.10gb     Device  0: (UUID: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a)<br>  MIG 1g.10gb     Device  1: (UUID: MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664)<br>  MIG 1g.10gb     Device  2: (UUID: MIG-d3ab0675-d318-5e53-b487-b50695cf2e00)<br>  MIG 1g.10gb     Device  3: (UUID: MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0)<br>  MIG 1g.10gb     Device  4: (UUID: MIG-85dd76da-c994-5830-adf0-467c66ae1b95)<br>  MIG 1g.10gb     Device  5: (UUID: MIG-29a6d43b-882e-5b79-868a-15bb2c770b82)<br>  MIG 1g.10gb     Device  6: (UUID: MIG-558debd2-dc13-5406-9256-73ef4f279737)<br>GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-1439a39c-948b-0657-98e9-aff8595a8729)<br>  MIG 2g.20gb     Device  0: (UUID: MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f)<br>  MIG 2g.20gb     Device  1: (UUID: MIG-0a804c0c-aa27-5993-97cb-cedb854735ce)<br>  MIG 2g.20gb     Device  2: (UUID: MIG-8af9af6d-8720-5763-81e2-83afc43eb42b)<br>  MIG 1g.10gb     Device  3: (UUID: MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a)<br>GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-3b182fed-bfdf-bc11-d1bd-c41fd1468d2a)<br>  MIG 3g.40gb     Device  0: (UUID: MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b)<br>  MIG 3g.40gb     Device  1: (UUID: MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc)<br>GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-7eba9d78-e908-3db8-2633-a269aeec395e)<br>  MIG 7g.80gb     Device  0: (UUID: MIG-ec043869-9176-577e-bac0-46c8411e4e37)<br><br>$ scontrol show node node16<br>NodeName=node16 Arch=x86_64 CoresPerSocket=1<br>   CPUAlloc=1 CPUEfctv=112 CPUTot=112 CPULoad=0.00<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1<br>   NodeAddr=node16 NodeHostName=node16 Version=22.05.9<br>   OS=Linux 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023<br>   RealMemory=1030000 AllocMem=0 FreeMem=1011801 Sockets=112 Boards=1<br>   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=gpu_H100<br>   BootTime=2024-01-18T13:52:27 SlurmdStartTime=2024-01-19T11:25:44<br>   LastBusyTime=2024-01-19T10:59:13<br>   CfgTRES=cpu=112,mem=1030000M,billing=112,gres/gpu=14<br>   AllocTRES=cpu=1<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>$ scontrol show partition gpu_H100<br>PartitionName=gpu_H100<br>   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL<br>   AllocNodes=ALL Default=NO QoS=N/A<br>   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO<br>   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED<br>   Nodes=node16<br>   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO<br>   OverTimeLimit=NONE PreemptMode=OFF<br>   State=UP TotalCPUs=112 TotalNodes=1 SelectTypeParameters=NONE<br>   JobDefaults=(null)<br>   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED<br>   TRES=cpu=112,mem=1030000M,node=1,billing=112,gres/gpu=14<br><br>slurm.conf:</div><div><br>AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:H100,gres/gpu:H100_1g.10gb,gres/gpu:H100_2g.20gb,gres/gpu:H100_3g.40gb,gres/gpu:H100_7g.80gb<br>GresTypes=gpu<br></div><div>...<br>NodeName=node16 CPUs=112 RealMemory=1030000 Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1<br>PartitionName=gpu_H100 Nodes=node16 Default=NO MaxTime=INFINITE State=UP<br></div><div><br></div><div>gres.conf<br><br>#AutoDetect=nvml<br># GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap66,/dev/nvidia-caps/nvidia-cap67 CPUs=0-55<br># GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi8/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap75,/dev/nvidia-caps/nvidia-cap76 CPUs=0-55<br># GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi9/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 CPUs=0-55<br># GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi10/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 CPUs=0-55<br># GPU 0 MIG 4 /proc/driver/nvidia/capabilities/gpu0/mig/gi11/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap102,/dev/nvidia-caps/nvidia-cap103 CPUs=0-55<br># GPU 0 MIG 5 /proc/driver/nvidia/capabilities/gpu0/mig/gi12/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap111,/dev/nvidia-caps/nvidia-cap112 CPUs=0-55<br># GPU 0 MIG 6 /proc/driver/nvidia/capabilities/gpu0/mig/gi13/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121 CPUs=0-55<br># GPU 1 MIG 0 /proc/driver/nvidia/capabilities/gpu1/mig/gi3/access<br>NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 CPUs=0-55<br># GPU 1 MIG 1 /proc/driver/nvidia/capabilities/gpu1/mig/gi5/access<br>NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap183,/dev/nvidia-caps/nvidia-cap184 CPUs=0-55<br># GPU 1 MIG 2 /proc/driver/nvidia/capabilities/gpu1/mig/gi6/access<br>NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap192,/dev/nvidia-caps/nvidia-cap193 CPUs=0-55<br># GPU 1 MIG 3 /proc/driver/nvidia/capabilities/gpu1/mig/gi9/access<br>NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 CPUs=0-55<br># GPU 2 MIG 0 /proc/driver/nvidia/capabilities/gpu2/mig/gi1/access<br>NodeName=node16 Name=gpu Type=H100_3g.40gb MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283 CPUs=0-55<br># GPU 2 MIG 1 /proc/driver/nvidia/capabilities/gpu2/mig/gi2/access<br>NodeName=node16 Name=gpu Type=H100_3g.40gb MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 CPUs=0-55<br># GPU 3 MIG 0 /proc/driver/nvidia/capabilities/gpu3/mig/gi0/access<br>NodeName=node16 Name=gpu Type=H100_7g.80gb MultipleFiles=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap408,/dev/nvidia-caps/nvidia-cap409 CPUs=0-55<br><br><br></div><div><br></div><div>[2024-01-19T14:31:13.701] debug:  gres/gpu: init: loaded<br>[2024-01-19T14:31:13.701] debug:  gpu/generic: init: init: GPU Generic plugin loaded<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1<br>[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1<br>[2024-01-19T14:31:13.702] Gres Name=gpu Type=H100_7g.80gb Count=1<br><br>Best,</div><div>Drazen Jalsovec</div></div>