[slurm-users] NVIDIA MIG question

Thu Nov 17 16:52:37 UTC 2022

The problem appears to be using AutoDetect=nvml in the gres.conf file.  When we remove that and fully specify everything (with help from the https://gitlab.com/nvidia/hpc/slurm-mig-discovery tool) then I am able to submit jobs allocating all of the MIG gpus at once, or submit X jobs asking for just 1 gpu, without them going to pending (until all gpus are used up).

Rob

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Groner, Rob <rug262 at psu.edu>
Sent: Thursday, November 17, 2022 10:08 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVIDIA MIG question

No, I can't submit more than 7 individual jobs and have them all run, the jobs after the first 7 will go to pending until the first 7 finish.

And it's not a limit (at least, not of "7"), because here's the same problem but with a node configured for 2x3g.20gb per card (2 cards, so, 4 total MIG gpus in the node)

[rug262 at testsch (RC) slurm] sinfo -o "%20N  %10c  %10m  %25f  %40G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
t-gc-1201             48          358400      3gc20gb                    gpu:nvidia_a100_3g.20gb:4(S:0)

So, there are 4 of them on that node

[rug262 at testsch (RC) slurm] sbatch --gpus=1 --cpus-per-task=2 --partition=debug --nodelist=t-gc-1201 --wrap="sleep 100"

I submit 3 jobs, each asking for 1 gpu from that node

[rug262 at testsch (RC) slurm] squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              5049     debug     wrap   rug262 PD       0:00      1 (Resources)
              5048     debug     wrap   rug262  R       0:09      1 t-gc-1201
              5047     debug     wrap   rug262  R       0:31      1 t-gc-1201

The first 2 go fine, but any after that go to pending, even though there should be 4 available (according to sinfo output)

Rob

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Yair Yarom <irush at cs.huji.ac.il>
Sent: Thursday, November 17, 2022 8:19 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVIDIA MIG question

You don't often get email from irush at cs.huji.ac.il. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Can you request more than 7 single gpu jobs on the same node?
It could be that there's another limit you've encountered (e.g. memory or cpu), or some other limit (in the account, partition, or qos)

On our setup we're limiting jobs to 1 gpu per job (via partition qos), however we can use up all the MIGs with single gpu jobs.

On Wed, 16 Nov 2022 at 23:48, Groner, Rob <rug262 at psu.edu<mailto:rug262 at psu.edu>> wrote:
That does help, thanks for the extra info.

If I have two separate GPU cards in the node, and I setup 7 MIGs on each card, for a total of 14 MIG "gpus" in the node...then, SHOULD I be able to salloc requesting, say 10 GPUs (7 from 1 card, 3 from the other)?  Because I can't.

I can request up to 7 just fine.  When I request more than that, it adds in other nodes to try to give me that, even though there are theoretically 14 on the one node.  When I ask for 8, it gives me 7 from t-gc-1202 and then 1 from t-gc-1201.  When I ask for 10, then it fails because it can't give me 10 without using 2 cards in one node.

[rug262 at testsch ~ ]# sinfo -o "%20N  %10c  %10m  %25f  %50G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
t-gc-1201             48          358400      3gc20gb                    gpu:nvidia_a100_3g.20gb:4(S:0)
t-gc-1202             48          358400      1gc5gb                     gpu:nvidia_a100_1g.5gb:14(S:0)

[rug262 at testsch (RC) ~] salloc --gpus=10 --account=1gc5gb --partition=sla-prio
salloc: Job allocation 5015 has been revoked.
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Rob

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Yair Yarom <irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>>
Sent: Wednesday, November 16, 2022 3:48 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] NVIDIA MIG question

You don't often get email from irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hi,

From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So you can have 14 jobs each using a different MIG.
However (unless something has changed in the past year), due to nvidia limitations, a single process can't access more than one MIG simultaneously (this is unrelated to Slurm). So while you can have a user request a Slurm job with 2 gpus (MIGs), they'll have to run two distinct processes within that job in order to utilize those two MIGs.

HTH,

On Tue, 15 Nov 2022 at 23:42, Laurence <laurence.field at cern.ch<mailto:laurence.field at cern.ch>> wrote:

Hi Rob,

Yes, those questions make sense. From what I understand, MIG should essentially split the GPU so that they behave as separate cards. Hence two different users should be able to use two different MIG instances at the same time and also a single job could use all 14 instances. The result you observed suggests that MIG is a feature of the driver i.e lspci shows one device but nvidia-smi shows 7 devices.

I haven't played around with this myself in slurm but would be interested to know the answers.

Laurence

On 15/11/2022 17:46, Groner, Rob wrote:
We have successfully used the nvidia-smi tool to take the 2 A100's in a node and split them into multiple GPU devices.  In one case, we split the 2 GPUS into 7 MIG devices each, so 14 in that node total, and in the other case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.

From our limited testing so far, and from the "sinfo" output, it appears that slurm might be considering all of the MIG devices on the node to be in the same socket (even though the MIG devices come from two separate graphics cards in the node).  The sinfo output says (S:0) after the 14 devices are shown, indicating they're in socket 0.  That seems to be preventing 2 different users from using MIG devices at the same time.  Am I wrong that having 14 MIG gres devices show up in slurm should mean that, in theory, 14 different users could use one at the same time?

Even IF that doesn't work....if I have 14 devices spread across 2 physical GPU cards, can one user utilize all 14 for a single job?  I would hope that slurm would treat each of the MIG devices as its own separate card, which would mean 14 different jobs could run at the same time using their own particular MIG, right?

Do those questions make sense to anyone?  🙂

Rob

--

  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>
 //        |

--

  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>
 //        |

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221117/9a06f048/attachment-0001.htm>