[slurm-users] Unconfigured GPUs being allocated

Wilson, Steven M stevew at purdue.edu
Fri Jul 14 20:10:32 UTC 2023

It's not so much whether a job may or may not access the GPU but rather which GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what our CUDA jobs can see and therefore use (within any cgroups constraints, of course). In my case, Slurm is sometimes setting $CUDA_VISIBLE_DEVICES to a GPU that is not in the Slurm configuration because it is intended only for driving the display and not GPU computations.

Thanks for your thoughts!

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Christopher Samuel <chris at csamuel.org>
Sent: Friday, July 14, 2023 1:57 PM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[You don't often get email from chris at csamuel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

---- External Email: Use caution with attachments, links, or sharing data ----

On 7/14/23 10:20 am, Wilson, Steven M wrote:

> I upgraded Slurm to 23.02.3 but I'm still running into the same problem.
> Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still
> being made available to jobs so we end up with compute jobs being run on
> GPUs which should only be used

I think this is expected - it's not that Slurm is making them available,
it's that it's unaware of them and so doesn't control them in the way it
does for the GPUs it does know about. So you get the default behaviour
(any process can access them).

If you want to stop them being accessed from Slurm you'd need to find a
way to prevent that access via cgroups games or similar.

All the best,
Chris Samuel  :  https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=05%7C01%7Cstevew%40purdue.edu%7C6fba97485b73413521d208db8494160a%7C4130bd397c53419cb1e58758d6d63f21%7C0%7C0%7C638249543794377751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VslW51ree1Ibt3xfYyy99Aj%2BREZh7BqpM6Ipg3jAM84%3D&reserved=0<http://www.csamuel.org/>  :  Berkeley, CA, USA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230714/1f3a1001/attachment-0001.htm>

More information about the slurm-users mailing list