[slurm-users] NVIDIA MIG question

Yair Yarom irush at cs.huji.ac.il
Wed Nov 16 08:48:38 UTC 2022


Hi,

>From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So
you can have 14 jobs each using a different MIG.
However (unless something has changed in the past year), due to nvidia
limitations, a single process can't access more than one MIG simultaneously
(this is unrelated to Slurm). So while you can have a user request a Slurm
job with 2 gpus (MIGs), they'll have to run two distinct processes within
that job in order to utilize those two MIGs.

HTH,


On Tue, 15 Nov 2022 at 23:42, Laurence <laurence.field at cern.ch> wrote:

> Hi Rob,
>
>
> Yes, those questions make sense. From what I understand, MIG should
> essentially split the GPU so that they behave as separate cards. Hence two
> different users should be able to use two different MIG instances at the
> same time and also a single job could use all 14 instances. The result you
> observed suggests that MIG is a feature of the driver i.e lspci shows one
> device but nvidia-smi shows 7 devices.
>
>
> I haven't played around with this myself in slurm but would be interested
> to know the answers.
>
>
> Laurence
>
>
> On 15/11/2022 17:46, Groner, Rob wrote:
>
> We have successfully used the nvidia-smi tool to take the 2 A100's in a
> node and split them into multiple GPU devices.  In one case, we split the 2
> GPUS into 7 MIG devices each, so 14 in that node total, and in the other
> case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.
>
> From our limited testing so far, and from the "sinfo" output, it appears
> that slurm might be considering all of the MIG devices on the node to be in
> the same socket (even though the MIG devices come from two separate
> graphics cards in the node).  The sinfo output says (S:0) after the 14
> devices are shown, indicating they're in socket 0.  That seems to be
> preventing 2 different users from using MIG devices at the same time.  Am I
> wrong that having 14 MIG gres devices show up in slurm should mean that, in
> theory, 14 different users could use one at the same time?
>
> Even IF that doesn't work....if I have 14 devices spread across 2 physical
> GPU cards, can one user utilize all 14 for a single job?  I would hope that
> slurm would treat each of the MIG devices as its own separate card, which
> would mean 14 different jobs could run at the same time using their own
> particular MIG, right?
>
> Do those questions make sense to anyone?  🙂
>
> Rob
>
>
>

-- 

  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | irush at cs.huji.ac.il
 //        |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221116/6d9219c8/attachment.htm>


More information about the slurm-users mailing list