On Friday, 25 October 2024 22:49:16 CET Kevin M. Hildebrand via slurm-users wrote:
We have a 'gpu' partition with 30 or so nodes, some with A100s, some with H100s, and a few others. It appears that when (for example) all of the A100 GPUs are in use, if there are additional jobs requesting A100 GPUs pending, and those jobs have the highest priority in the partition, then jobs submitted for H100s won't run even if there are idle H100s. This is a small subset of our present pending queue- the four bottom jobs should be running, but aren't. The top pending job shows reason 'Resources' while the rest all show 'Priority'. Any thoughts on why this might be happening?
JOBID PRIORITY TRES_ALLOC
8317749 501490 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317750 501490 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317745 501490 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317746 501490 cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8338679 500060 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338678 500060 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338677 500060 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338676 500060 cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
Do you have Backfill Scheduling configured with bf_continue?
regards Markus Köberl