[slurm-users] Slurm 21.08 and MIG autodetection

Taras Shapovalov tshapovalov at nvidia.com
Fri Jul 1 07:59:02 UTC 2022


Hi guys,

I am trying to double check that what I see is expected for 21.08.08-2 with A100 GPUs. Can someone please confirm the following:


  1.  According to the very short documentation all I need to do to support already partitioned MIG devices is to add AutoDetect=nvml into gres.conf (we use global gres.conf for all nodes) and then add "Gres=gpu:4" to NodeName line in slurm.conf. In this case I get only very simple GPU detection (just number of GPUs, though according slurmd log it detects a lot of MIG properties, but they are not used). In scontrol for this node I see "Gres=gpu:4(S:1)", nothing more.

  2.
If I want to allow users to request MIG profiles instead of just number of GPUs, then I will need to add NodeName lines to gres.conf with AutoDetect=nvml and profile names, add the profile names to NodeName line in slurm.conf, like "Gres=gpu:1g.5gb:3,gpu:4g.20gb:1" and then also add these profile names to GresTypes. So 3 places to update. And only then scontrol starts to show me "Gres=gpu:4g.20gb:1(S:1),gpu:1g.5gb:3(S:1)" (in fact in 21.08.8-2 I see "Gres=gpu:4g.20gb:1(S:1),gpu:1g.5gb:3(S:1),gpu:1g.5gb:1g.5gb:3,gpu:4g.20gb:4g.20gb:1", sounds like a minor bug, but seems not important).

  3.
To track the MIG profiles usage I need also to add all current MIG profiles (used on all nodes of my cluster) to AccountingStorageTRES in slurm.conf.

  4.  Each time I re-partition the MIGs, then I need to update AccountingStorageTRES, NodeName and GresTypes lines in slurm.conf, and plus NodeName lines in gres.conf, and restart all involved slurmd and slurmctld.

  5.

Is my understanding of the documentation complete or I miss something that will allow me to not update 4 places ni slurm.conf/gres.conf on the GPUs repartitioning?

Best regards,

Taras

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220701/d0f778e9/attachment.htm>


More information about the slurm-users mailing list