<div dir="ltr"><div>Hi,</div><div><br></div><div>From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So you can have 14 jobs each using a different MIG.<br></div><div>However (unless something has changed in the past year), due to nvidia limitations, a single process can't access more than one MIG simultaneously (this is unrelated to Slurm). So while you can have a user request a Slurm job with 2 gpus (MIGs), they'll have to run two distinct processes within that job in order to utilize those two MIGs.</div><div><br></div><div>HTH,</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 15 Nov 2022 at 23:42, Laurence <<a href="mailto:laurence.field@cern.ch">laurence.field@cern.ch</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Rob, <br>
</p>
<p><br>
</p>
<p>Yes, those questions make sense. From what I understand, MIG
should essentially split the GPU so that they behave as separate
cards. Hence two different users should be able to use two
different MIG instances at the same time and also a single job
could use all 14 instances. The result you observed suggests that
MIG is a feature of the driver i.e lspci shows one device but
nvidia-smi shows 7 devices.<br>
</p>
<p><br>
</p>
<p>I haven't played around with this myself in slurm but would be
interested to know the answers. <br>
</p>
<p><br>
</p>
<p>Laurence <br>
</p>
<p><br>
</p>
<div>On 15/11/2022 17:46, Groner, Rob wrote:<br>
</div>
<blockquote type="cite">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
We have successfully used the nvidia-smi tool to take the 2
A100's in a node and split them into multiple GPU devices. In
one case, we split the 2 GPUS into 7 MIG devices each, so 14 in
that node total, and in the other case, we split the 2 GPUs into
2 MIG devices each, so 4 total in the node.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
From our limited testing so far, and from the "sinfo" output, it
appears that slurm might be considering all of the MIG devices
on the node to be in the same socket (even though the MIG
devices come from two separate graphics cards in the node). The
sinfo output says (S:0) after the 14 devices are shown,
indicating they're in socket 0. That seems to be preventing 2
different users from using MIG devices at the same time. Am I
wrong that having 14 MIG gres devices show up in slurm should
mean that, in theory, 14 different users could use one at the
same time?</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
Even IF that doesn't work....if I have 14 devices spread across
2 physical GPU cards, can one user utilize all 14 for a single
job? I would hope that slurm would treat each of the MIG
devices as its own separate card, which would mean 14 different
jobs could run at the same time using their own particular MIG,
right?</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
Do those questions make sense to anyone? <span id="m_2940866027957411465🙂">🙂</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<span>Rob</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<span><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<span><br>
</span></div>
</blockquote>
</div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">
<div>
<pre style="font-family:monospace"> <span style="color:rgb(133,12,27)">/|</span> |
<span style="color:rgb(133,12,27)">\/</span> | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
<span style="color:rgb(92,181,149)">[]</span> | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
<span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span> | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
<span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span> | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
<span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span> <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span> | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
<span style="color:rgb(1,84,76)">//</span> <span style="color:rgb(21,122,134)">\</span> | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank">irush@cs.huji.ac.il</a></span>
<span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span> |
</pre>
</div>
</div></div>