[slurm-users] Unconfigured GPUs being allocated

Wilson, Steven M stevew at purdue.edu
Fri Jul 14 17:20:39 UTC 2023


I upgraded Slurm to 23.02.3 but I'm still running into the same problem. Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still being made available to jobs so we end up with compute jobs being run on GPUs which should only be used

Any ideas?

Thanks,
Steve
________________________________
From: Wilson, Steven M
Sent: Tuesday, June 27, 2023 9:50 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Unconfigured GPUs being allocated

Hi,

I manually configure the GPUs in our Slurm configuration (AutoDetect=off in gres.conf) and everything works fine when all the GPUs in a node are configured in gres.conf and available to Slurm.  But we have some nodes where a GPU is reserved for running the display and is specifically not configured in gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes it available to Slurm jobs.  Using a simple Slurm job that executes "nvidia-smi -L", it will display the unconfigured GPU along with as many configured GPUs as requested by the job.

For example, in a node configured with this line in slurm.conf:
    NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
and this line in gres.conf:
    Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
I will get the following results from a job running "nvidia-smi -L" that requested a single GPU:
    GPU 0: NVIDIA GeForce GT 710 (UUID: GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
    GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-0dc4da58-5026-6173-1156-c4559a268bf5)

But in another node that has all GPUs configured in Slurm like this in slurm.conf:
    NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 Gres=gpu:TITANX:2
and this line in gres.conf:
    Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
I get the expected results from the job running "nvidia-smi -L" that requested a single GPU:
    GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)

I'm running Slurm 22.05.5.

Thanks in advance for any suggestions to help correct this problem!

Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230714/25137d1c/attachment.htm>


More information about the slurm-users mailing list