Scheduling issues with multiple different types of GPU in one partition

List overview All Threads
Download

newer

older

seff for GPU

New "NOT-state" selection of the...

Kevin M. Hildebrand

11 Sep 2025 11 Sep '25

7:23 p.m.

We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available.

For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 H100 GPU jobs running and more in queue waiting for them, subsequently submitted A100 jobs will sit in queue even if there are plenty of idle A100 GPUs. The only way we can get the A100 jobs to run is by manually bumping their priority higher than the pending H100 jobs.

Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks, Kevin

-- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

Attachments:

attachment.html (text/html — 1.8 KB)

Show replies by date

Paul Edmon

11 Sep 11 Sep

7:34 p.m.

Yes, we've see the same thing with mosaic/heterogeneous partitions. Our solution is to split based on hardware type.

Having a bunch of partitions may seem unwieldy but the scheduler can handle it. For instance we have 110 partitions and the scheduler handles it fine (most of those are hardware owned by specific groups not public partitions everyone can see). We've taken up the convention of naming our partitions after the hardware type. For instance we have a gpu partition (our A100's) and a gpu_h200 partition. Making it easy for people to identify the hardware. People who can use both will leverage mutltipartition submission ala #SBATCH -p gpu,gpu_h200.

I don't know of a good solution if you want to keep the mosiac partition as it really requires you users to think at a higher level and realize there is vacant hardware that could be used if they just selected a different gpu type. Having a separate partition makes it much easier to see.

-Paul Edmon-

On 9/11/2025 3:23 PM, Kevin M. Hildebrand via slurm-users wrote:

...

We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available.

For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 H100 GPU jobs running and more in queue waiting for them, subsequently submitted A100 jobs will sit in queue even if there are plenty of idle A100 GPUs. The only way we can get the A100 jobs to run is by manually bumping their priority higher than the pending H100 jobs.

Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks, Kevin

-- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

Cutts, Tim

17 Sep 17 Sep

3:31 p.m.

We have heterogeneous partitions too. We see this occasionally, but it’s not a huge problem. The way we have things set up is all the nodes are shared by three partitions; short-gpu, medium-gpu and long-gpu. The difference between the partitions is the priority and the partition QoS. Short-gpu has the highest priority, and allows the highest proportion of the GPUs to be used by a single user, but has short maximum time limit for the jobs (2 hours). Conversely, long-gpu doesn’t let the user use many GPUs, but they can run for a long time. Medium-gpu, obviously, is somewhere between the two. This seems to work reasonably well, and I can usually get a GPU for a short job almost immediately.

I would check your priority weights - if you have job age dominating in the priority calculation, you’re likely to have issues where young jobs don’t run, even if they fit, with the resulting situation being what you see. We try to set priority so that Fairshare dominates while jobs are young, and it’s only if they’ve been pending for a long time that age really starts to overtake fair share. We also set QoS priority weight very high, so that really critical jobs go straight to the top of the queue, but those qos’s are always tightly constrained to a very small number of resources (we have a ‘priority’ qos, but it only allows the user to consume 16 CPUs and a single GPU)

I have to say, I find this to be an area where SLURM is a bit weaker than some other schedulers. It’s very difficult, sometimes, to really understand why a particular job isn’t running. I used to be an LSF administrator, and I really loved the ‘bjobs -l -p’ command in LSF, which tells you exactly why a job cannot be run on each node, and the answer can be different in each case.

Tim

From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Date: Thursday, 11 September 2025 at 20:36 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition

Yes, we've see the same thing with mosaic/heterogeneous partitions. Our solution is to split based on hardware type.

-Paul Edmon-

On 9/11/2025 3:23 PM, Kevin M. Hildebrand via slurm-users wrote: We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available.

Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks, Kevin

-- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com

Ryan Novosielski

11 Sep 11 Sep

7:38 p.m.

Are you saying these are jobs that should be able to run right now but they’re just not getting considered, or there’s something that’s wrong about the way they’re submitted that has to be manually corrected to allow them to run on A100s?

If the former, it sounds like your backfill settings just might be inadequate to allow it to consider jobs far enough down the list.

-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski (he/him) - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'

On Sep 11, 2025, at 15:23, Kevin M. Hildebrand via slurm-users slurm-users@lists.schedmd.com wrote:

Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks, Kevin

-- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Kevin M. Hildebrand

8:06 p.m.

The former- jobs should run but are not. We currently have these backfill parameters set: bf_continue,bf_max_job_user=10. bf_max_job_test is the default of 500. However sdiag says the number of times bf_max_job_test has been hit is zero, so that's probably not relevant. I can try removing bf_max_job_user, but I don't think that's the issue either, as this problem also seems to affect users with few jobs in queue when a different user has all of one GPU type consumed.

Kevin

On Thu, Sep 11, 2025 at 3:38 PM Ryan Novosielski novosirj@rutgers.edu wrote:

...

Are you saying these are jobs that should be able to run right now but they’re just not getting considered, or there’s something that’s wrong about the way they’re submitted that has to be manually corrected to allow them to run on A100s?

If the former, it sounds like your backfill settings just might be inadequate to allow it to consider jobs far enough down the list.

-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski (he/him) - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'

On Sep 11, 2025, at 15:23, Kevin M. Hildebrand via slurm-users < slurm-users@lists.schedmd.com> wrote:

We have several different types of GPUs in the same 'gpu' partition. The problem we're having occurs when one of those types of GPUs is fully occupied and there are a bunch of queued jobs waiting for those GPUs. If someone requests idle GPUs of a different type, those jobs end up getting stalled, even though there are plenty of GPUs available.

For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 H100 GPU jobs running and more in queue waiting for them, subsequently submitted A100 jobs will sit in queue even if there are plenty of idle A100 GPUs. The only way we can get the A100 jobs to run is by manually bumping their priority higher than the pending H100 jobs.

Has anyone else encountered this issue? The only way we can think of to potentially solve it is to have separate partitions for each GPU type, but that seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks, Kevin

-- Kevin Hildebrand Director of Research Technology and HPC Services Division of IT

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Gerhard Strangar

12 Sep 12 Sep

4:37 a.m.

Kevin M. Hildebrand via slurm-users wrote:

...

bf_max_job_test is the default of 500. However sdiag says the number of times bf_max_job_test has been hit is zero, so that's probably not relevant.

Du you have weights defined? I've seen the scheduler insisting on waiting for nodes with lower weights when higher weighted ones were idling. An squeue --start will tell.

Gerhard

Bjørn-Helge Mevik

7:09 a.m.

"Kevin M. Hildebrand via slurm-users" slurm-users@lists.schedmd.com writes:

...

The former- jobs should run but are not. We currently have these backfill parameters set: bf_continue,bf_max_job_user=10. bf_max_job_test is the default of 500. However sdiag says the number of times bf_max_job_test has been hit is zero, so that's probably not relevant. I can try removing bf_max_job_user, but I don't think that's the issue either, as this problem also seems to affect users with few jobs in queue when a different user has all of one GPU type consumed.

Perhaps you can add more debugging in slurmctld, for instance DebugFlags=Backfill,SelectType (and possibly Gres) and increase SlurmctldDebug to debug2 or debug3. Then you might see *why* it doesn't schedule the jobs.

-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

Christopher Samuel

3 p.m.

On 9/12/25 3:09 am, Bjørn-Helge Mevik via slurm-users wrote:

...

Perhaps you can add more debugging in slurmctld, for instance DebugFlags=Backfill,SelectType (and possibly Gres) and increase SlurmctldDebug to debug2 or debug3.

Both of these can be changed on the fly with scontrol too:

https://slurm.schedmd.com/scontrol.html#OPT_setdebug

https://slurm.schedmd.com/scontrol.html#OPT_setdebugflags

FWIW we run all the time at debug and with debugflags=backfill.

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

116

Age (days ago)

122

Last active (days ago)

slurm-users@lists.schedmd.com

7 comments

7 participants

tags (0)

participants (7)

Bjørn-Helge Mevik
Christopher Samuel
Cutts, Tim
Gerhard Strangar
Kevin M. Hildebrand
Paul Edmon
Ryan Novosielski