[slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition

11 Sep 2025


      Yes, we've see the same thing with mosaic/heterogeneous partitions. Our 
solution is to split based on hardware type.
Having a bunch of partitions may seem unwieldy but the scheduler can 
handle it. For instance we have 110 partitions and the scheduler handles 
it fine (most of those are hardware owned by specific groups not public 
partitions everyone can see). We've taken up the convention of naming 
our partitions after the hardware type. For instance we have a gpu 
partition (our A100's) and a gpu_h200 partition. Making it easy for 
people to identify the hardware. People who can use both will leverage 
mutltipartition submission ala #SBATCH -p gpu,gpu_h200.
I don't know of a good solution if you want to keep the mosiac partition 
as it really requires you users to think at a higher level and realize 
there is vacant hardware that could be used if they just selected a 
different gpu type. Having a separate partition makes it much easier to see.
-Paul Edmon-
On 9/11/2025 3:23 PM, Kevin M. Hildebrand via slurm-users wrote:
...
We have several different types of GPUs in the same 'gpu' partition.  
The problem we're having occurs when one of those types of GPUs is 
fully occupied and there are a bunch of queued jobs waiting for those 
GPUs.  If someone requests idle GPUs of a different type, those jobs 
end up getting stalled, even though there are plenty of GPUs available.
For example, say we have 10 A100 GPUs and 10 H100 GPUs.  If there are 
10 H100 GPU jobs running and more in queue waiting for them, 
subsequently submitted A100 jobs will sit in queue even if there are 
plenty of idle A100 GPUs.  The only way we can get the A100 jobs to 
run is by manually bumping their priority higher than the pending H100 
jobs.
Has anyone else encountered this issue?  The only way we can think of 
to potentially solve it is to have separate partitions for each GPU 
type, but that seems unwieldy.
We are currently running Slurm 24.05.8.
Thanks,
Kevin
-- 
Kevin Hildebrand
Director of Research Technology and HPC Services
Division of IT

2025

2024

[slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition