<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hello All,</p>
<p>I am currently working in a research project and we are trying to
find out whether we can use NVIDIAs multi-instance GPU (MIG)
dynamically in SLURM.</p>
<p>For instance:</p>
<p>- a user requests a job and wants a GPU but none is available </p>
<p>- now SLURM will reconfigure a MIG GPU to create a partition
(e.g. 1g.5gb) which becomes available and allocated immediately</p>
<p>I can already reconfigure MIG + SLURM within a few seconds to
start jobs on newly partitioned resources, but Jobs get killed
when I restart slurmd on nodes with a changed MIG config. (see
script example below)</p>
<p><b>Do you think it is possible to develop a plugin or change
SLURM to the extent that dynamic MIG will be supported one day?
</b></p>
<p>(The website says it is not supported)<b><br>
</b></p>
<p><br>
</p>
<p><b><br>
</b></p>
<p>Best</p>
<p>- Aaron<b><br>
</b></p>
<p><br>
</p>
<p><br>
</p>
<p><font size="1"><br>
#!/usr/bin/bash<br>
<br>
# Generate Start Config<br>
killall slurmd<br>
killall slurmctld<br>
nvidia-smi mig -dci<br>
nvidia-smi mig -dgi<br>
nvidia-smi mig -cgi 19,14,5 -i 0 -C<br>
nvidia-smi mig -cgi 0 -i 1 -C<br>
cp -f ./slurm-19145-0.conf /etc/slurm/slurm.conf<br>
slurmd -c<br>
slurmctld -c<br>
sleep 5<br>
<br>
# Start a running and a pending job (the first job gets killed
by slurm)<br>
srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300
& <br>
srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300
&<br>
sleep 5<br>
<br>
# Simulate MIG Config Change<br>
nvidia-smi mig -i 1 -dci<br>
nvidia-smi mig -i 1 -dgi<br>
nvidia-smi mig -cgi 19,14,5 -i 1 -C<br>
cp -f ./slurm-2x19145.conf /etc/slurm/slurm.conf<br>
killall slurmd<br>
killall slurmctld<br>
slurmd<br>
slurmctld</font><br>
</p>
</body>
</html>