[slurm-users] Re: Dynamic MIG Question

6 Feb 2026

      I actually spent a bit of time in the SLURM booth at SC discussing this
(and also frequently hanging out in their comfy chairs - easy times on the
bad hip).
This is on the back burner for us. The basic problem is that SLURM doesn't
have a mechanism to drain a GPU; rather, the entire node has to be drained
to make changes. That's the easy description of the problem. There may be
ways to do it within the current capabilities of SLURM, but we haven't
picked up that effort in earnest, yet...
We do find some occasional issues in nvml control of reconfiguring MIG on
multi-GPU systems, where our scripts occasionally fail on one of more GPUs,
and they need to be manually reconfigured after that (but that's something
in the nvidia driver, presumably). We either run nodes un-MIGed or split
into two nominally equal slices. We're currently defaulting to using MIG on
half of our B200s and all of our Max-Qs.
Being able to drain a single GPU would obviously be great.
On Fri, Feb 6, 2026 at 5:07 PM Davide DelVento via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
Aaron (or anyone else),
Did you manage to get Dynamic MIG working in Slurm? I'm actually surprised
that after these many years SchedMD has not implemented this feature yet,
especially now that newer GPUs allow MIG repartitioning without being root.
The only mention of this in their ticketing system is at
https://support.schedmd.com/show_bug.cgi?id=11091#c8 (and subsequent c10)
which say that it's not on their roadmap, but that was 5 years ago.
I have heard that some users manage dynamic changes by draining nodes,
running scripts to reconfigure MIG via nvidia-smi, bringing the node back
and then submitting the job. Anybody here has tried that and with what
success?
I speculate that now that NVIDIA owns SchedMD perhaps this feature will be
at a higher priority, but maybe not? Anybody knows anything about it and is
not bound by an NDA to keep mum?
Thanks
On Wed, Nov 22, 2023 at 1:22 PM Davide DelVento davide.quantum@gmail.com
wrote:
...
I assume you mean the sentence about dynamic MIG at
https://slurm.schedmd.com/gres.html#MIG_Management
Could it be supported? I think so, but only if one of their paying
customers (that could be you) asks for it.
On Wed, Nov 22, 2023 at 11:24 AM Aaron Kollmann <
aaron.kollmann@student.hpi.de> wrote:
...
Hello All,
I am currently working in a research project and we are trying to find
out whether we can use NVIDIAs multi-instance GPU (MIG) dynamically in
SLURM.
For instance:

a user requests a job and wants a GPU but none is available

now SLURM will reconfigure a MIG GPU to create a partition (e.g.

1g.5gb) which becomes available and allocated immediately
I can already reconfigure MIG + SLURM within a few seconds to start jobs
on newly partitioned resources, but Jobs get killed when I restart slurmd
on nodes with a changed MIG config. (see script example below)
*Do you think it is possible to develop a plugin or change SLURM to the
extent that dynamic MIG will be supported one day? *
(The website says it is not supported)
Best

Aaron

#!/usr/bin/bash
# Generate Start Config
killall slurmd
killall slurmctld
nvidia-smi mig -dci
nvidia-smi mig -dgi
nvidia-smi mig -cgi 19,14,5 -i 0 -C
nvidia-smi mig -cgi 0 -i 1 -C
cp -f ./slurm-19145-0.conf /etc/slurm/slurm.conf
slurmd -c
slurmctld -c
sleep 5
# Start a running and a pending job (the first job gets killed by slurm)
srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 &
srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 &
sleep 5
# Simulate MIG Config Change
nvidia-smi mig -i 1 -dci
nvidia-smi mig -i 1 -dgi
nvidia-smi mig -cgi 19,14,5 -i 1 -C
cp -f ./slurm-2x19145.conf /etc/slurm/slurm.conf
killall slurmd
killall slurmctld
slurmd
slurmctld
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2026

2025

2024

[slurm-users] Re: Dynamic MIG Question