[slurm-users] MIG-Slice: Unavailable GRES

Vogt, Timon timon.vogt at gwdg.de
Thu Jul 20 10:16:07 UTC 2023


Hi Rob,

thank you very much for that hint. I tried setting the MIG slices 
manually in the gres.conf and it works now.

Thank you very much.
Best regards,
Timon

-- 
Timon Vogt
Arbeitsgruppe "Computing"
Nationales Hochleistungsrechnen (NHR)
Scientific Employee NHR
Tel.: +49 551 39-30146, E-Mail:timon.vogt at gwdg.de
-------------------------------------------------------------------------
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL:https://gwdg.de

Support: Tel.: +49 551 39-30000, URL:https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail:gwdg at gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001 und ISO 27001
-------------------------------------------------------------------------

Am 19.07.23 um 21:21 schrieb Groner, Rob:
> At some  point when we were experimenting with MIG, I was being 
> entirely frustrated in getting it to work until I finally removed the 
> autodetect from gres.conf and explicitly listed the stuff instead.  
> THEN it worked.  I think you can find the list of files that are the 
> device files using nvidia-smi.
>
> Here is the entry we use in our gres.conf for one of the nodes:
>
> NodeName=p-gc-3037 Name=gpu Type=1g.5gb 
> File=/dev/nvidia-caps/nvidia-cap[66,75,84,102,111,120,129,201,210,219,228,237,246,255]
>
> Something to TRY anyway.  Odd that 3g.20gb works.  You might try 
> reconfiguring the node for that instead and see if it works then.  
> We've used 3g.20gb and 1g.5gb on our nodes and it works fine, never 
> tried 2g.10gb.
>
> Rob
>
>
> ------------------------------------------------------------------------
> *From:* slurm-users on behalf of Vogt, Timon
> *Sent:* Wednesday, July 19, 2023 3:08 PM
> *To:* slurm-users at schedmd.com
> *Subject:* [slurm-users] MIG-Slice: Unavailable GRES
>
> Dear Slurm Mailing List,
>
> I am experiencing a problem which affects our cluster and for which I am
> completely out of ideas by now, so I would like to ask the community for
> hints or ideas.
>
> We run a partition on our cluster containing multiple nodes with Nvidia
> A100 GPUs (40GB), which we have sliced up using Nvidia Multi-Instance
> GPUs (MIG) into one 3g.20gb slice and two 2g.10gb slices per GPU.
>
> Now, when submitting a job to it and requesting the 3g.20gb slice (like
> with "srun -p mig-partition -G 3g.20gb:1 hostname"), the job runs fine,
> but when a job requests one of the 2g.10gb slices instead (like with
> "srun -p mig-partition -G 2g.10gb:1 hostname"), the job does not get
> scheduled and the controller repeatedly outputs the error:
>
> slurmctld[28945]: error: _set_job_bits1: job 4780824 failed to find any
> available GRES on node 1471
> slurmctld[28945]: error: gres_select_filter_select_and_set job 4780824
> failed to satisfy gres-per-job counter
>
> Our cluster uses the AutoDetect=nvml feature for the nodes in the
> gres.conf and both slice types are defined in "AccountingStorageTRES"
> and in the GRES parameter of the node definition. The slurmd on the node
> also finds both types of slices and reports the correct amounts. They
> are also visible in the "Gres=" section of "scontrol show node", again
> in correct amounts.
>
> I have also ensured that the nodes are not used otherwise by creating a
> reservation on them accessible only to me, and I have restarted all
> slurmd's and the slurmctld.
>
> By now, I am out of ideas. Does someone here have a suggestion on what
> else I can try? Has someone already seen this error and knows more 
> about it?
>
> Thank you very much in advance and
> best regards,
> Timon
>
> -- 
> Timon Vogt
> Arbeitsgruppe "Computing"
> Nationales Hochleistungsrechnen (NHR)
> Scientific Employee NHR
> Tel.: +49 551 39-30146, E-Mail: timon.vogt at gwdg.de
> -------------------------------------------------------------------------
> Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
> Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de
>
> Support: Tel.: +49 551 39-30000, URL: https://gwdg.de/support
> Sekretariat: Tel.: +49 551 39-30001, E-Mail: gwdg at gwdg.de
>
> Geschäftsführer: Prof. Dr. Ramin Yahyapour
> Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
> Sitz der Gesellschaft: Göttingen
> Registergericht: Göttingen, Handelsregister-Nr. B 598
>
> Zertifiziert nach ISO 9001 und ISO 27001
> -------------------------------------------------------------------------
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230720/7b3c5e13/attachment-0003.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0x6441BD7DD0CD6C40.asc
Type: application/pgp-keys
Size: 2452 bytes
Desc: OpenPGP public key
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230720/7b3c5e13/attachment-0003.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230720/7b3c5e13/attachment-0003.sig>


More information about the slurm-users mailing list