[slurm-users] NVIDIA MIG question
Groner, Rob
rug262 at psu.edu
Thu Nov 17 15:08:26 UTC 2022
No, I can't submit more than 7 individual jobs and have them all run, the jobs after the first 7 will go to pending until the first 7 finish.
And it's not a limit (at least, not of "7"), because here's the same problem but with a node configured for 2x3g.20gb per card (2 cards, so, 4 total MIG gpus in the node)
[rug262 at testsch (RC) slurm] sinfo -o "%20N %10c %10m %25f %40G "
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
t-gc-1201 48 358400 3gc20gb gpu:nvidia_a100_3g.20gb:4(S:0)
So, there are 4 of them on that node
[rug262 at testsch (RC) slurm] sbatch --gpus=1 --cpus-per-task=2 --partition=debug --nodelist=t-gc-1201 --wrap="sleep 100"
I submit 3 jobs, each asking for 1 gpu from that node
[rug262 at testsch (RC) slurm] squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5049 debug wrap rug262 PD 0:00 1 (Resources)
5048 debug wrap rug262 R 0:09 1 t-gc-1201
5047 debug wrap rug262 R 0:31 1 t-gc-1201
The first 2 go fine, but any after that go to pending, even though there should be 4 available (according to sinfo output)
Rob
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Yair Yarom <irush at cs.huji.ac.il>
Sent: Thursday, November 17, 2022 8:19 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] NVIDIA MIG question
You don't often get email from irush at cs.huji.ac.il. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Can you request more than 7 single gpu jobs on the same node?
It could be that there's another limit you've encountered (e.g. memory or cpu), or some other limit (in the account, partition, or qos)
On our setup we're limiting jobs to 1 gpu per job (via partition qos), however we can use up all the MIGs with single gpu jobs.
On Wed, 16 Nov 2022 at 23:48, Groner, Rob <rug262 at psu.edu<mailto:rug262 at psu.edu>> wrote:
That does help, thanks for the extra info.
If I have two separate GPU cards in the node, and I setup 7 MIGs on each card, for a total of 14 MIG "gpus" in the node...then, SHOULD I be able to salloc requesting, say 10 GPUs (7 from 1 card, 3 from the other)? Because I can't.
I can request up to 7 just fine. When I request more than that, it adds in other nodes to try to give me that, even though there are theoretically 14 on the one node. When I ask for 8, it gives me 7 from t-gc-1202 and then 1 from t-gc-1201. When I ask for 10, then it fails because it can't give me 10 without using 2 cards in one node.
[rug262 at testsch ~ ]# sinfo -o "%20N %10c %10m %25f %50G "
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
t-gc-1201 48 358400 3gc20gb gpu:nvidia_a100_3g.20gb:4(S:0)
t-gc-1202 48 358400 1gc5gb gpu:nvidia_a100_1g.5gb:14(S:0)
[rug262 at testsch (RC) ~] salloc --gpus=10 --account=1gc5gb --partition=sla-prio
salloc: Job allocation 5015 has been revoked.
salloc: error: Job submit/allocate failed: Requested node configuration is not available
Rob
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Yair Yarom <irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>>
Sent: Wednesday, November 16, 2022 3:48 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] NVIDIA MIG question
You don't often get email from irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hi,
From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So you can have 14 jobs each using a different MIG.
However (unless something has changed in the past year), due to nvidia limitations, a single process can't access more than one MIG simultaneously (this is unrelated to Slurm). So while you can have a user request a Slurm job with 2 gpus (MIGs), they'll have to run two distinct processes within that job in order to utilize those two MIGs.
HTH,
On Tue, 15 Nov 2022 at 23:42, Laurence <laurence.field at cern.ch<mailto:laurence.field at cern.ch>> wrote:
Hi Rob,
Yes, those questions make sense. From what I understand, MIG should essentially split the GPU so that they behave as separate cards. Hence two different users should be able to use two different MIG instances at the same time and also a single job could use all 14 instances. The result you observed suggests that MIG is a feature of the driver i.e lspci shows one device but nvidia-smi shows 7 devices.
I haven't played around with this myself in slurm but would be interested to know the answers.
Laurence
On 15/11/2022 17:46, Groner, Rob wrote:
We have successfully used the nvidia-smi tool to take the 2 A100's in a node and split them into multiple GPU devices. In one case, we split the 2 GPUS into 7 MIG devices each, so 14 in that node total, and in the other case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.
From our limited testing so far, and from the "sinfo" output, it appears that slurm might be considering all of the MIG devices on the node to be in the same socket (even though the MIG devices come from two separate graphics cards in the node). The sinfo output says (S:0) after the 14 devices are shown, indicating they're in socket 0. That seems to be preventing 2 different users from using MIG devices at the same time. Am I wrong that having 14 MIG gres devices show up in slurm should mean that, in theory, 14 different users could use one at the same time?
Even IF that doesn't work....if I have 14 devices spread across 2 physical GPU cards, can one user utilize all 14 for a single job? I would hope that slurm would treat each of the MIG devices as its own separate card, which would mean 14 different jobs could run at the same time using their own particular MIG, right?
Do those questions make sense to anyone? 🙂
Rob
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel and Selim Benin School
[] /\ | of Computer Science and Engineering
[]//\\/ | The Hebrew University of Jerusalem
[// \\ | T +972-2-5494522 | F +972-2-5494522
// \ | irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>
// |
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel and Selim Benin School
[] /\ | of Computer Science and Engineering
[]//\\/ | The Hebrew University of Jerusalem
[// \\ | T +972-2-5494522 | F +972-2-5494522
// \ | irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>
// |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221117/f512c3a5/attachment-0001.htm>
More information about the slurm-users
mailing list