[slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition

11 Sep 2025


      The former- jobs should run but are not.
We currently have these backfill parameters set:
bf_continue,bf_max_job_user=10.
bf_max_job_test is the default of 500.  However sdiag says the number of
times bf_max_job_test has been hit is zero, so that's probably not relevant.
I can try removing bf_max_job_user, but I don't think that's the issue
either, as this problem also seems to affect users with few jobs in queue
when a different user has all of one GPU type consumed.
Kevin
On Thu, Sep 11, 2025 at 3:38 PM Ryan Novosielski novosirj@rutgers.edu
wrote:
...
Are you saying these are jobs that should be able to run right now but
they’re just not getting considered, or there’s something that’s wrong
about the way they’re submitted that has to be manually corrected to allow
them to run on A100s?
If the former, it sounds like your backfill settings just might be
inadequate to allow it to consider jobs far enough down the list.
--
#BlackLivesMatter
____
|| \UTGERS,     |---------------------------*O*---------------------------
||_// the State  |     Ryan Novosielski (he/him) - novosirj@rutgers.edu
|| \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \    of NJ  | Office of Advanced Research Computing - MSB
A555B, Newark
     `'
On Sep 11, 2025, at 15:23, Kevin M. Hildebrand via slurm-users <
slurm-users@lists.schedmd.com> wrote:
We have several different types of GPUs in the same 'gpu' partition.  The
problem we're having occurs when one of those types of GPUs is fully
occupied and there are a bunch of queued jobs waiting for those GPUs.  If
someone requests idle GPUs of a different type, those jobs end up getting
stalled, even though there are plenty of GPUs available.
For example, say we have 10 A100 GPUs and 10 H100 GPUs.  If there are 10
H100 GPU jobs running and more in queue waiting for them, subsequently
submitted A100 jobs will sit in queue even if there are plenty of idle A100
GPUs.  The only way we can get the A100 jobs to run is by manually bumping
their priority higher than the pending H100 jobs.
Has anyone else encountered this issue?  The only way we can think of to
potentially solve it is to have separate partitions for each GPU type, but
that seems unwieldy.
We are currently running Slurm 24.05.8.
Thanks,
Kevin
--
Kevin Hildebrand
Director of Research Technology and HPC Services
Division of IT
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2025

2024

[slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition