[slurm-users] MIG-Slice: Unavailable GRES

Dražen Jalšovec drazen.jalsovec at gmail.com
Mon Jan 22 03:10:03 UTC 2024


Hi,
We are testing the MIG deployment on our new slurm compute node with 4 x
H100 GPUs. It looks like everything is configured correctly but we have a
problem accessing mig devices. When I submit jobs requesting a mig gpu
device #SBATCH --gres=gpu:H100_1g.10gb:1, the jobs get submitted to the
node, but only 4 jobs get executed and all other jobs fail. I was able to
"solve it" by adding code to submit to the mig device UUID. Am I missing
something in the slurm/gres/cgroups configuration that would automatically
assign correct CUDA_VISIBLE_DEVICES MIG UUID to the job?

If I submit 10 jobs requesting --gres=gpu:H100_1g.10gb:1

8 jobs start running:
#1 - CUDA_VISIBLE_DEVICES=0 - jobs 1-4 run on one of the mig devices
configured on GPU-device
...
#8 - CUDA_VISIBLE_DEVICES=7 - jobs 5-8 fail with the message that there are
no available CUDA_VISIBLE_DEVICES
...
#10 - CUDA_VISIBLE_DEVICES=? - after waiting for free mig resources jobs
9-10 runs or fails depending if GPU-device is free (migs are free)

$ nvidia-smi --query-gpu=uuid --id=0 --format=csv,noheader
GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3
$ nvidia-smi --query-gpu=uuid --id=7 --format=csv,noheader
No devices were found

The slurm assigns CUDA_VISIBLE_DEVICES numbers based on the requested
resources:

Request:
gpu:H100_1g.10gb:1 - CUDA_VISIBLE_DEVICES=0
gpu:H100_1g.10gb:2 - CUDA_VISIBLE_DEVICES=0,1
...
Any job with H100_1g.10gb request is going to get device number between 0
and 7
gpu:H100_1g.10gb:8 - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

gpu:H100_2g.20gb:1 - CUDA_VISIBLE_DEVICES=8
Any job with H100_2g.20gb request is going to get device number between 8
and 10
...
gpu:H100_2g.10gb:3 - CUDA_VISIBLE_DEVICES=8,9,10
gpu:H100_3g.40gb:1 - CUDA_VISIBLE_DEVICES=11
gpu:H100_3g.40gb:2 - CUDA_VISIBLE_DEVICES=12
gpu:H100_7g.80gb:1 - CUDA_VISIBLE_DEVICES=13

I was able to "solve it" by adding this code to my slurm script:

# Get the CUDA index using srun and awk
CUDA_INDEX=$(srun env | grep CUDA_VISIBLE_DEVICES | awk -F '=' '{print $2}')

## GPU0 1g.10gb
## MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a
## MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664
## MIG-d3ab0675-d318-5e53-b487-b50695cf2e00
## MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0
## MIG-85dd76da-c994-5830-adf0-467c66ae1b95
## MIG-29a6d43b-882e-5b79-868a-15bb2c770b82
## MIG-558debd2-dc13-5406-9256-73ef4f279737
## GPU1 2g.20gb
## MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f
## MIG-0a804c0c-aa27-5993-97cb-cedb854735ce
## MIG-8af9af6d-8720-5763-81e2-83afc43eb42b
## GPU1 1g.10gb
## MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a
## GPU2 3g.40gb
## MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b
## MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc
## GPU3 7g.80gb
## MIG-ec043869-9176-577e-bac0-46c8411e4e37

# Define the list of UUIDs
UUIDS=(
  "MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a"
  "MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664"
  "MIG-d3ab0675-d318-5e53-b487-b50695cf2e00"
  "MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0"
  "MIG-85dd76da-c994-5830-adf0-467c66ae1b95"
  "MIG-29a6d43b-882e-5b79-868a-15bb2c770b82"
  "MIG-558debd2-dc13-5406-9256-73ef4f279737"
  "MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a"
  "MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f"
  "MIG-0a804c0c-aa27-5993-97cb-cedb854735ce"
  "MIG-8af9af6d-8720-5763-81e2-83afc43eb42b"
  "MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b"
  "MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc"
  "MIG-ec043869-9176-577e-bac0-46c8411e4e37"
)

# Assign the UUID based on the CUDA index
SELECTED_UUID=${UUIDS[$CUDA_INDEX]}

# Print the selected UUID
echo $SELECTED_UUID

export CUDA_VISIBLE_DEVICES=$SELECTED_UUID

echo "CUDA_VISIBLE_DEVICES set to: "$SELECTED_UUID

echo "Test: "$CUDA_VISIBLE_DEVICES

CUDA_VISIBLE_DEVICES=$SELECTED_UUID python3.8 gpu_script.py

Result:

Script start
MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a
CUDA_VISIBLE_DEVICES set to: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a
Matrix calculation on CUDA device completed successfully.
Script end

+---------------------------------------------------------------------------------------+
| Processes:
             |
|  GPU   GI   CI        PID   Type   Process name
 GPU Memory |
|        ID   ID
  Usage      |
|=======================================================================================|
|    0    7    0      28686      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    0    8    0      28744      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    0    9    0      28300      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    0   10    0      28512      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    0   11    0      28506      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    0   12    0      28516      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    0   13    0      28511      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
|    1    9    0      28552      C   ...gpu-cuda12.1-python38/bin/python3.8
     866MiB |
+---------------------------------------------------------------------------------------+

This is the current setup:

RockyLinux 9.2
slurm-22.05.9-1.el9.x86_64
CUDA Version: 12.1 Driver Version: 530.30.02

# nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID:
GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3)
  MIG 1g.10gb     Device  0: (UUID:
MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a)
  MIG 1g.10gb     Device  1: (UUID:
MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664)
  MIG 1g.10gb     Device  2: (UUID:
MIG-d3ab0675-d318-5e53-b487-b50695cf2e00)
  MIG 1g.10gb     Device  3: (UUID:
MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0)
  MIG 1g.10gb     Device  4: (UUID:
MIG-85dd76da-c994-5830-adf0-467c66ae1b95)
  MIG 1g.10gb     Device  5: (UUID:
MIG-29a6d43b-882e-5b79-868a-15bb2c770b82)
  MIG 1g.10gb     Device  6: (UUID:
MIG-558debd2-dc13-5406-9256-73ef4f279737)
GPU 1: NVIDIA H100 80GB HBM3 (UUID:
GPU-1439a39c-948b-0657-98e9-aff8595a8729)
  MIG 2g.20gb     Device  0: (UUID:
MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f)
  MIG 2g.20gb     Device  1: (UUID:
MIG-0a804c0c-aa27-5993-97cb-cedb854735ce)
  MIG 2g.20gb     Device  2: (UUID:
MIG-8af9af6d-8720-5763-81e2-83afc43eb42b)
  MIG 1g.10gb     Device  3: (UUID:
MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a)
GPU 2: NVIDIA H100 80GB HBM3 (UUID:
GPU-3b182fed-bfdf-bc11-d1bd-c41fd1468d2a)
  MIG 3g.40gb     Device  0: (UUID:
MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b)
  MIG 3g.40gb     Device  1: (UUID:
MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc)
GPU 3: NVIDIA H100 80GB HBM3 (UUID:
GPU-7eba9d78-e908-3db8-2633-a269aeec395e)
  MIG 7g.80gb     Device  0: (UUID:
MIG-ec043869-9176-577e-bac0-46c8411e4e37)

$ scontrol show node node16
NodeName=node16 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=1 CPUEfctv=112 CPUTot=112 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)

 Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1
   NodeAddr=node16 NodeHostName=node16 Version=22.05.9
   OS=Linux 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16
09:55:41 UTC 2023
   RealMemory=1030000 AllocMem=0 FreeMem=1011801 Sockets=112 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu_H100
   BootTime=2024-01-18T13:52:27 SlurmdStartTime=2024-01-19T11:25:44
   LastBusyTime=2024-01-19T10:59:13
   CfgTRES=cpu=112,mem=1030000M,billing=112,gres/gpu=14
   AllocTRES=cpu=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

$ scontrol show partition gpu_H100
PartitionName=gpu_H100
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=node16
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=112 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=112,mem=1030000M,node=1,billing=112,gres/gpu=14

slurm.conf:

AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:H100,gres/gpu:H100_1g.10gb,gres/gpu:H100_2g.20gb,gres/gpu:H100_3g.40gb,gres/gpu:H100_7g.80gb
GresTypes=gpu
...
NodeName=node16 CPUs=112 RealMemory=1030000
Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1
PartitionName=gpu_H100 Nodes=node16 Default=NO MaxTime=INFINITE State=UP

gres.conf

#AutoDetect=nvml
# GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap66,/dev/nvidia-caps/nvidia-cap67
CPUs=0-55
# GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi8/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap75,/dev/nvidia-caps/nvidia-cap76
CPUs=0-55
# GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi9/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85
CPUs=0-55
# GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi10/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94
CPUs=0-55
# GPU 0 MIG 4 /proc/driver/nvidia/capabilities/gpu0/mig/gi11/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap102,/dev/nvidia-caps/nvidia-cap103
CPUs=0-55
# GPU 0 MIG 5 /proc/driver/nvidia/capabilities/gpu0/mig/gi12/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap111,/dev/nvidia-caps/nvidia-cap112
CPUs=0-55
# GPU 0 MIG 6 /proc/driver/nvidia/capabilities/gpu0/mig/gi13/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121
CPUs=0-55
# GPU 1 MIG 0 /proc/driver/nvidia/capabilities/gpu1/mig/gi3/access
NodeName=node16 Name=gpu Type=H100_2g.20gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166
CPUs=0-55
# GPU 1 MIG 1 /proc/driver/nvidia/capabilities/gpu1/mig/gi5/access
NodeName=node16 Name=gpu Type=H100_2g.20gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap183,/dev/nvidia-caps/nvidia-cap184
CPUs=0-55
# GPU 1 MIG 2 /proc/driver/nvidia/capabilities/gpu1/mig/gi6/access
NodeName=node16 Name=gpu Type=H100_2g.20gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap192,/dev/nvidia-caps/nvidia-cap193
CPUs=0-55
# GPU 1 MIG 3 /proc/driver/nvidia/capabilities/gpu1/mig/gi9/access
NodeName=node16 Name=gpu Type=H100_1g.10gb
MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220
CPUs=0-55
# GPU 2 MIG 0 /proc/driver/nvidia/capabilities/gpu2/mig/gi1/access
NodeName=node16 Name=gpu Type=H100_3g.40gb
MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283
CPUs=0-55
# GPU 2 MIG 1 /proc/driver/nvidia/capabilities/gpu2/mig/gi2/access
NodeName=node16 Name=gpu Type=H100_3g.40gb
MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292
CPUs=0-55
# GPU 3 MIG 0 /proc/driver/nvidia/capabilities/gpu3/mig/gi0/access
NodeName=node16 Name=gpu Type=H100_7g.80gb
MultipleFiles=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap408,/dev/nvidia-caps/nvidia-cap409
CPUs=0-55



[2024-01-19T14:31:13.701] debug:  gres/gpu: init: loaded
[2024-01-19T14:31:13.701] debug:  gpu/generic: init: init: GPU Generic
plugin loaded
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1
[2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1
[2024-01-19T14:31:13.702] Gres Name=gpu Type=H100_7g.80gb Count=1

Best,
Drazen Jalsovec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240122/3792bc1e/attachment-0001.htm>


More information about the slurm-users mailing list