[slurm-users] Slurm and MIG configuration help

Wed Apr 12 21:18:43 UTC 2023

Hi all! I've successfully managed to configure slurm on one head node and
two different compute nodes, one using "old" consumer RTX cards, a new one
using 4xA100 GPUS (80gb version).
I am now trying to set up a hybrid MIG configuration, where devices 0,1 are
kept as is, while 2 and 3 are split into 3.40gb MIG instances.

MIG itself works well, I am able to keep 0,1 disabled and 2,3 enabled with
2x40gb.
Trying to configure slurm with this had me lost: I am trying countless
variations, but there isn't a single one that has worked so far.
Here's what I have at the moment:

- My gres.conf has gone from the full list to literally just
"AutoDetect=nvml", slurmd -G returns a somewhat reasonable output:

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: Gres Name=gpu Type=a100 Count=1 Index=0 ID=7696487
File=/dev/nvidia0 Cores=24-31 CoreCnt=128 Links=-1,4,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=283 ID=7696487
File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283
Cores=56-63 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=418 ID=7696487
File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap417,/dev/nvidia-caps/nvidia-cap418
Cores=40-47 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100 Count=1 Index=1 ID=7696487
File=/dev/nvidia1 Cores=8-15 CoreCnt=128 Links=4,-1,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=292 ID=7696487
File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292
Cores=56-63 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=427 ID=7696487
File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427
Cores=40-47 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

And here I have the first doubt: *the MIG profile is supposed to be called
3g.40gb, why is it popping up as 3g.39gb?*

- My slurm.conf is very similar to the documentation example, with:
Gres=gpu:a100:2,gpu:a100_3g.39gb:4
- I restarted *slurmctld *and *slurmd *on the node, everything appears to
be working.

When I try to send a *srun *command, weird stuff happens:
- srun --gres=gpu:a100:2 returns a non-mig device AND a mig device together
- sinfo only shows 2 a100 gpus "*gpu:a100:2(S:1)*", or gpu count too low (0
< 4) for the MIG devices and stays in drain state
- the fullly qualified name "gpu:a100_3g.39gb:1" returns  "Unable to
allocate resources: Requested node configuration is not available".
*Where do I start to fix this mess?*

Thank you for your patience!
Cheers,

Edoardo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230412/9c09a701/attachment.htm>