[slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm

Mon Jan 31 15:05:44 UTC 2022

This is not an answer on the MIG issue but on the question that Esben
has. We at SURF have developed sharing of all the GPUs in a node. We 
"misuse" the SLURM mps feature. At SURF this mostly use for GPU courses, 
eg: jupyterhub

We have tested it with slum version  20.11.8. The code is public 
available at:
  * https://github.com/basvandervlies/surf_slurm_mps

On 27/01/2022 17:00, EPF (Esben Peter Friis) wrote:
> Hi Mathias
> 
> I can't answer your specific question, so this is more of a comment 🙂
> 
> We have a system with 8 x Nvidia A40, where we would like to share each 
> GPU between several jobs (they have 48GB each), eg starting 32 jobs, 
> with 4 on each GPU. I looked into MIG as well, but unfortunately that is 
> not supported by the A40 hardware (only A30 and A100).
> I have tried MPS, but strangely that works only for the first GPU on 
> each node, so only one of the 8 GPUs in our system can be shared in this 
> way. That is, it used to be like that. A couple of weeks ago, an 
> "all_sharing" flag was introduced for gres.conf, which apparently should 
> make it possible to share all the GPUs with MPS. I haven't tried it yet, 
> but it may be worth a try. It should be possible to configure some GPUs 
> as mps and some as gpu resources.
> 
> Cheers,
> 
> Esben
> 
> 
> https://github.com/SchedMD/slurm/blob/master/doc/man/man5/gres.conf.5 
> <https://github.com/SchedMD/slurm/blob/master/doc/man/man5/gres.conf.5>
> 
> all_sharing
> 	To be used on a shared gres. This is the opposite of one_sharing and 
> can be
> 	used to allow all sharing gres (gpu) on a node to be used for shared 
> gres (mps).
> 	
> 	NOTE: If a gres has this flag configured it is global, so all other 
> nodes with
> 	that gres will have this flag implied.  This flag is not combatible with
> 	one_sharing for a specific gres.
> 
> 
> 
> 
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of 
> Matthias Leopold <matthias.leopold at meduniwien.ac.at>
> *Sent:* Thursday, January 27, 2022 16:27
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm
> Hi,
> 
> we have 2 DGX A100 systems which we would like to use with Slurm. We
> want to use the MIG feature for _some_ of the GPUs. As I somehow
> suspected I couldn't find a working setup for this in Slurm yet. I'll
> describe the configuration variants I tried after creating the MIG
> instances, it might be a longer read, please bear with me.
> 
> 1. using slurm-mig-discovery for gres.conf
> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fnvidia%2Fhpc%2Fslurm-mig-discovery&data=04%7C01%7Cepf%40novozymes.com%7Ca0989487f14947bd269808d9e1a99cc4%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637788940862367005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AHN8h8%2FcB4xeC7MFdYgZRG7L62PTiz4OvampC1vWL5Q%3D&reserved=0 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fnvidia%2Fhpc%2Fslurm-mig-discovery&data=04%7C01%7Cepf%40novozymes.com%7Ca0989487f14947bd269808d9e1a99cc4%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637788940862367005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AHN8h8%2FcB4xeC7MFdYgZRG7L62PTiz4OvampC1vWL5Q%3D&reserved=0>)
> - CUDA_VISIBLE_DEVICES: list of indices
> -> seems to bring a working setup and full flexibility at first, but
> when taking a closer look the selection of GPU devices is completely
> unpredictable (output of nvidia-smi inside Slurm job)
> 
> 2. using "AutoDetect=nvml" in gres.conf (Slurm docs)
> - CUDA_VISIBLE_DEVICES: MIG format (see
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fcuda%2Fcuda-c-programming-guide%2Findex.html%23env-vars&data=04%7C01%7Cepf%40novozymes.com%7Ca0989487f14947bd269808d9e1a99cc4%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637788940862367005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZFXeCQh3%2FLO52zpiOqn7CROjq7wvj8zWRE8mj87P%2Bew%3D&reserved=0) 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fcuda%2Fcuda-c-programming-guide%2Findex.html%23env-vars&data=04%7C01%7Cepf%40novozymes.com%7Ca0989487f14947bd269808d9e1a99cc4%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637788940862367005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZFXeCQh3%2FLO52zpiOqn7CROjq7wvj8zWRE8mj87P%2Bew%3D&reserved=0)>
> 
> 2.1 converting ALL GPUs to MIG
> - also a full A100 is converted to a 7g.40gb MIG instance
> - gres.conf: "AutoDetect=nvml" only
> - slurm.conf Node Def: naming all MIG types (read from slurmd debug log)
> -> working setup
> -> problem: IPC (MPI) between MIG instances not possible, this seems to
> be a by-design limitation
> 
> 2.2 converting SOME GPUs to MIG
> - some A100 are NOT in MIG mode
> 
> 2.2.1 using "AutoDetect=nvml" only (Variant 1)
> - slurm.conf Node Def: Gres with and without type
> -> problem: fatal: _foreach_slurm_conf: Some gpu GRES in slurm.conf have
> a type while others do not (slurm_gres->gres_cnt_config (26) > tmp_count
> (21))
> 
> 2.2.2 using "AutoDetect=nvml" only (Variant 2)
> - slurm.conf Node Def: only Gres without type (sum of MIG + non MIG)
> -> problem: different GPU types can't be requested
> 
> 2.2.3 using partial "AutoDetect=nvml"
> - gres.conf: "AutoDetect=nvml" + hardcoding of non MIG GPUs
> - slurm.conf Node Def: MIG + non MIG Gres types
> -> produces a "perfect" config according to slurmd debug log
> -> problem: the sanity-check mode of "AutoDetect=nvml" prevents
> operation (?)
> -> Reason=gres/gpu:1g.5gb count too low (0 < 21) [slurm at 2022-01-27T11:23:59]
> 
> 2.2.4 using static gres.conf with NVML generated config
> - using a gres.conf with NVML generated config where I can define the
> type for non MIG GPU and also set the UniqueId for MIG instances would
> be the perfect solution
> - slurm.conf Node Def: MIG + non MIG Gres types
> -> problem: it doesn't work
> -> Parsing error at unrecognized key: UniqueId
> 
> Thanks for reading this far. Am I missing something? How can MIG and non
> MIG devices be addressed in a cluster? This setup of having MIG and non
> MIG devices can't be exotic, since having ONLY MIG devices has severe
> disadvantages (see 2.1). Thanks again for any advice.
> 
> Matthias
> 

-- 
Bas van der Vlies
| HPCV Supercomputing | Internal Services  | SURF | 
https://userinfo.surfsara.nl |
| Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 |
|  bas.vandervlies at surf.nl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2329 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220131/bc2cfa74/attachment-0001.bin>