[slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm - within one node

Thu Jan 27 19:22:35 UTC 2022

I should have mentioned more prominently that my problem is about 
addressing NVIDIA MIG + non MIG devices in Slurm WITHIN ONE NODE 
(although this is obvious from reading my report). That also gives me 
the last resort solution of having one node with all MIG devices and one 
node with no MIG devices. But there are downsides like no multi node MPI 
jobs and in general I still can't believe there is such a limitation.

thx again for any feedback
Matthias

Am 27.01.22 um 16:27 schrieb Matthias Leopold:
> Hi,
> 
> we have 2 DGX A100 systems which we would like to use with Slurm. We 
> want to use the MIG feature for _some_ of the GPUs. As I somehow 
> suspected I couldn't find a working setup for this in Slurm yet. I'll 
> describe the configuration variants I tried after creating the MIG 
> instances, it might be a longer read, please bear with me.
> 
> 1. using slurm-mig-discovery for gres.conf 
> (https://gitlab.com/nvidia/hpc/slurm-mig-discovery)
> - CUDA_VISIBLE_DEVICES: list of indices
> -> seems to bring a working setup and full flexibility at first, but 
> when taking a closer look the selection of GPU devices is completely 
> unpredictable (output of nvidia-smi inside Slurm job)
> 
> 2. using "AutoDetect=nvml" in gres.conf (Slurm docs)
> - CUDA_VISIBLE_DEVICES: MIG format (see 
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars)
> 
> 2.1 converting ALL GPUs to MIG
> - also a full A100 is converted to a 7g.40gb MIG instance
> - gres.conf: "AutoDetect=nvml" only
> - slurm.conf Node Def: naming all MIG types (read from slurmd debug log)
> -> working setup
> -> problem: IPC (MPI) between MIG instances not possible, this seems to 
> be a by-design limitation
> 
> 2.2 converting SOME GPUs to MIG
> - some A100 are NOT in MIG mode
> 
> 2.2.1 using "AutoDetect=nvml" only (Variant 1)
> - slurm.conf Node Def: Gres with and without type
> -> problem: fatal: _foreach_slurm_conf: Some gpu GRES in slurm.conf have 
> a type while others do not (slurm_gres->gres_cnt_config (26) > tmp_count 
> (21))
> 
> 2.2.2 using "AutoDetect=nvml" only (Variant 2)
> - slurm.conf Node Def: only Gres without type (sum of MIG + non MIG)
> -> problem: different GPU types can't be requested
> 
> 2.2.3 using partial "AutoDetect=nvml"
> - gres.conf: "AutoDetect=nvml" + hardcoding of non MIG GPUs
> - slurm.conf Node Def: MIG + non MIG Gres types
> -> produces a "perfect" config according to slurmd debug log
> -> problem: the sanity-check mode of "AutoDetect=nvml" prevents 
> operation (?)
> -> Reason=gres/gpu:1g.5gb count too low (0 < 21) 
> [slurm at 2022-01-27T11:23:59]
> 
> 2.2.4 using static gres.conf with NVML generated config
> - using a gres.conf with NVML generated config where I can define the 
> type for non MIG GPU and also set the UniqueId for MIG instances would 
> be the perfect solution
> - slurm.conf Node Def: MIG + non MIG Gres types
> -> problem: it doesn't work
> -> Parsing error at unrecognized key: UniqueId
> 
> Thanks for reading this far. Am I missing something? How can MIG and non 
> MIG devices be addressed in a cluster? This setup of having MIG and non 
> MIG devices can't be exotic, since having ONLY MIG devices has severe 
> disadvantages (see 2.1). Thanks again for any advice.
> 
> Matthias
>