[slurm-users] Re: Scheduling oddity with multiple GPU types in same partition

29 Oct 2024


      On Friday, 25 October 2024 22:49:16 CET Kevin M. Hildebrand via slurm-users 
wrote:
...
We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
H100s, and a few others.
It appears that when (for example) all of the A100 GPUs are in use, if
there are additional jobs requesting A100 GPUs pending, and those jobs have
the highest priority in the partition, then jobs submitted for H100s won't
run even if there are idle H100s.  This is a small subset of our present
pending queue- the four bottom jobs should be running, but aren't.  The top
pending job shows reason 'Resources' while the rest all show 'Priority'.
Any thoughts on why this might be happening?
JOBID               PRIORITY            TRES_ALLOC
8317749             501490
 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317750             501490
 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317745             501490
 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317746             501490
 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8338679             500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338678             500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338677             500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338676             500060
 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
Do you have Backfill Scheduling configured with bf_continue?
regards
Markus Köberl
-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeberl@tugraz.at

2026

2025

2024

[slurm-users] Re: Scheduling oddity with multiple GPU types in same partition