We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available.

For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 H100 GPU jobs running and more in queue waiting for them, subsequently submitted A100 jobs will sit in queue even if there are plenty of idle A100 GPUs. The only way we can get the A100 jobs to run is by manually bumping their priority higher than the pending H100 jobs.

Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks,

Kevin

Kevin Hildebrand

Director of Research Technology and HPC Services

Division of IT