[slurm-users] Slurm doesn't allocate job on available MIGs

Tristan Gillard tristan.gillard at centralesupelec.fr
Tue Dec 12 08:49:33 UTC 2023


Hello,

we have a problem on a DGX where the 4 A100s are split into different MIGs (Multi-Instance GPUs).

We use slurm to allocate jobs on partitions grouping MIGs according to their size:
- prod10 for 10 x 1g10gb
- prod20 for 4 x 2g20gb
- prod40 for 1 x 3g40gb
- prod80 for 1 x A100g80gb

The problem encountered is, for example:
1. a first job runs on prod40
2. a second job is pending for its place on prod40, since no more 3g40gb MIGs are available (reason: ressources)
3. a third job is waiting to run on prod10, even though all 10 1g10gb MIGs are available (reason: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

We don't understand why slurm doesn't allocate 1g.10gb MIGs for the 3rd job, which we don't think should have to wait.

In the event that there is no second job waiting, the jobs can use prod10 without waiting.

The slurm.conf and gres.conf files are available as attachments.

Can anyone help us solve the problem?

Have a nice day,

Tristan Gillard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231212/f0b179a6/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dgx_gres.conf
Type: application/octet-stream
Size: 1992 bytes
Desc: dgx_gres.conf
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231212/f0b179a6/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dgx_slurm.conf
Type: application/octet-stream
Size: 2982 bytes
Desc: dgx_slurm.conf
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231212/f0b179a6/attachment-0001.obj>


More information about the slurm-users mailing list