[slurm-users] Mixing GPU Types on Same Node
irush at cs.huji.ac.il
Sun Apr 2 12:32:05 UTC 2023
With regards to 2. If you're using AccountingStorageTres, I think you can
specify each gres/gpu:<type> to be monitored in addition to the generic
gres/gpu. And then have for all accounts "GrpTRES=gres/gpu=0" so they won't
be able to use gres/gpu, but only gres/gpu:<type>.
We haven't tried this, but it's been on our todo list for a while now. So
I'd like to know if it works :)
On Wed, 29 Mar 2023 at 21:31, <collin.m.mccarthy at gmail.com> wrote:
> Apologies if this is in the docs but I couldn’t find it anywhere.
> I’ve been using Slurm to run a small 7-node cluster in a research lab for
> a couple of years now (I’m a PhD student). A couple of our nodes have
> heterogenous GPU models. One in particular has quite a few: 2x NVIDIA
> A100s, 1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD
> MI200. This makes things a bit challenging but I need to work with what I
> 1. I’ve only been able to set this up previously on Slurm 20.02 by
> “ignoring” the AMDs and just specifying the NVIDIA GPUs. That worked when
> we had one or two people using the AMD GPUs and they could coordinate
> between themselves. But now, we have more people interested. I’m upgrading
> Slurm to 23.02 in hopes that might fix some of the challenges, but
> should this be possible? Ideally I would like to have AutoDetect=nvml
> and AutoDetect=rsmi both on. If it’s not I’ll shuffle GPUs around to
> make this node NVIDIA-only.
> 2. I want everyone to allocate GPUs with --gpus=<type>:<num> instead
> of --gpus=<num>, so they don’t “block” a nice GPU like an A100 when
> they really wanted any-old GPU on the machine like a GV100 or 3090. Can I
> force people to specify a GPU type and not just a count? This is especially
> important if I’m mixing AMDs and NVIDIAs on the same node. If not, can I
> specify the “order” in which I want GPUs to be scheduled if they don’t
> specify a type (so they get handed out from least-powerful to most-powerful
> if people don’t care)?
> Any help and/or advice here is much appreciated. Slurm has been amazing
> for our lab (albeit challenging to setup at first) and I want to get
> everything dialed before I graduate :D .
\/ | Yair Yarom | System Group (DevOps)
 | The Rachel and Selim Benin School
 /\ | of Computer Science and Engineering
//\\/ | The Hebrew University of Jerusalem
[// \\ | T +972-2-5494522 | F +972-2-5494522
// \ | irush at cs.huji.ac.il
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users