[slurm-users] Mixing GPU Types on Same Node

Yair Yarom irush at cs.huji.ac.il
Sun Apr 2 12:32:05 UTC 2023


With regards to 2. If you're using AccountingStorageTres, I think you can
specify each gres/gpu:<type> to be monitored in addition to the generic
gres/gpu. And then have for all accounts "GrpTRES=gres/gpu=0" so they won't
be able to use gres/gpu, but only gres/gpu:<type>.

We haven't tried this, but it's been on our todo list for a while now. So
I'd like to know if it works :)

On Wed, 29 Mar 2023 at 21:31, <collin.m.mccarthy at gmail.com> wrote:

> Hello,
> Apologies if this is in the docs but I couldn’t find it anywhere.
> I’ve been using Slurm to run a small 7-node cluster in a research lab for
> a couple of years now (I’m a PhD student). A couple of our nodes have
> heterogenous GPU models. One in particular has quite a few: 2x NVIDIA
> A100s, 1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD
> MI200. This makes things a bit challenging but I need to work with what I
> have.
>    1. I’ve only been able to set this up previously on Slurm 20.02 by
>    “ignoring” the AMDs and just specifying the NVIDIA GPUs. That worked when
>    we had one or two people using the AMD GPUs and they could coordinate
>    between themselves. But now, we have more people interested. I’m upgrading
>    Slurm to 23.02 in hopes that might fix some of the challenges, but
>    should this be possible? Ideally I would like to have AutoDetect=nvml
>    and AutoDetect=rsmi both on. If it’s not I’ll shuffle GPUs around to
>    make this node NVIDIA-only.
>    2. I want everyone to allocate GPUs with --gpus=<type>:<num> instead
>    of --gpus=<num>, so they don’t “block” a nice GPU like an A100 when
>    they really wanted any-old GPU on the machine like a GV100 or 3090. Can I
>    force people to specify a GPU type and not just a count? This is especially
>    important if I’m mixing AMDs and NVIDIAs on the same node. If not, can I
>    specify the “order” in which I want GPUs to be scheduled if they don’t
>    specify a type (so they get handed out from least-powerful to most-powerful
>    if people don’t care)?
> Any help and/or advice here is much appreciated. Slurm has been amazing
> for our lab (albeit challenging to setup at first) and I want to get
> everything dialed before I graduate :D .
> Thanks,
> -Collin


  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | irush at cs.huji.ac.il
 //        |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230402/6f176d2f/attachment.htm>

More information about the slurm-users mailing list