[slurm-users] MIG-Slice: Unavailable GRES

Vogt, Timon timon.vogt at gwdg.de
Wed Jul 19 19:08:24 UTC 2023


Dear Slurm Mailing List,

I am experiencing a problem which affects our cluster and for which I am 
completely out of ideas by now, so I would like to ask the community for 
hints or ideas.

We run a partition on our cluster containing multiple nodes with Nvidia 
A100 GPUs (40GB), which we have sliced up using Nvidia Multi-Instance 
GPUs (MIG) into one 3g.20gb slice and two 2g.10gb slices per GPU.

Now, when submitting a job to it and requesting the 3g.20gb slice (like 
with "srun -p mig-partition -G 3g.20gb:1 hostname"), the job runs fine, 
but when a job requests one of the 2g.10gb slices instead (like with 
"srun -p mig-partition -G 2g.10gb:1 hostname"), the job does not get 
scheduled and the controller repeatedly outputs the error:

slurmctld[28945]: error: _set_job_bits1: job 4780824 failed to find any 
available GRES on node 1471
slurmctld[28945]: error: gres_select_filter_select_and_set job 4780824 
failed to satisfy gres-per-job counter

Our cluster uses the AutoDetect=nvml feature for the nodes in the 
gres.conf and both slice types are defined in "AccountingStorageTRES" 
and in the GRES parameter of the node definition. The slurmd on the node 
also finds both types of slices and reports the correct amounts. They 
are also visible in the "Gres=" section of "scontrol show node", again 
in correct amounts.

I have also ensured that the nodes are not used otherwise by creating a 
reservation on them accessible only to me, and I have restarted all 
slurmd's and the slurmctld.

By now, I am out of ideas. Does someone here have a suggestion on what 
else I can try? Has someone already seen this error and knows more about it?

Thank you very much in advance and
best regards,
Timon

-- 
Timon Vogt
Arbeitsgruppe "Computing"
Nationales Hochleistungsrechnen (NHR)
Scientific Employee NHR
Tel.: +49 551 39-30146, E-Mail: timon.vogt at gwdg.de
-------------------------------------------------------------------------
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de

Support: Tel.: +49 551 39-30000, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: gwdg at gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001 und ISO 27001
-------------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0x6441BD7DD0CD6C40.asc
Type: application/pgp-keys
Size: 2453 bytes
Desc: OpenPGP public key
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230719/3bb1e423/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230719/3bb1e423/attachment-0001.sig>


More information about the slurm-users mailing list