[slurm-users] Mixing GPU Types on Same Node

Wed Mar 29 18:29:21 UTC 2023

Hello,

Apologies if this is in the docs but I couldn't find it anywhere. 

I've been using Slurm to run a small 7-node cluster in a research lab for a
couple of years now (I'm a PhD student). A couple of our nodes have
heterogenous GPU models. One in particular has quite a few: 2x NVIDIA A100s,
1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD MI200. This
makes things a bit challenging but I need to work with what I have. 

1.	I've only been able to set this up previously on Slurm 20.02 by
"ignoring" the AMDs and just specifying the NVIDIA GPUs. That worked when we
had one or two people using the AMD GPUs and they could coordinate between
themselves. But now, we have more people interested. I'm upgrading Slurm to
23.02 in hopes that might fix some of the challenges, but should this be
possible? Ideally I would like to have AutoDetect=nvml and AutoDetect=rsmi
both on. If it's not I'll shuffle GPUs around to make this node NVIDIA-only.
2.	I want everyone to allocate GPUs with --gpus=<type>:<num> instead of
--gpus=<num>, so they don't "block" a nice GPU like an A100 when they really
wanted any-old GPU on the machine like a GV100 or 3090. Can I force people
to specify a GPU type and not just a count? This is especially important if
I'm mixing AMDs and NVIDIAs on the same node. If not, can I specify the
"order" in which I want GPUs to be scheduled if they don't specify a type
(so they get handed out from least-powerful to most-powerful if people don't
care)? 

Any help and/or advice here is much appreciated. Slurm has been amazing for
our lab (albeit challenging to setup at first) and I want to get everything
dialed before I graduate :D . 

Thanks,
-Collin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230329/eec586ea/attachment-0001.htm>