[slurm-users] Issue with AMD SMT, CUDA_VISIBLE_DEVICES and MiG AutoDetect=nvml

Zachary Newell znewell at nshe.nevada.edu
Tue Feb 21 17:23:26 UTC 2023


Hi Everyone,

Has anyone seen an issue where the CUDA_VISIBLE_DEVICES environmental variable is set to an integer (0, 1, 2 or 3 for us) instead of the UUID (MIG-xxx) when AMD SMT is enabled? Not sure if this is a bug but it feels like one. Certain libraries like pytorch 1.13 cannot find a MiG when CUDA_VISIBLE_DEVICES is set to an integer.

Thanks,
--
Zachary Newell
Research Computing Engineer
NSHE System Computing Services
PUBLIC RECORDS NOTICE: In accordance with NRS Chapter 239, this email and responses, unless otherwise made confidential by law, may be subject to the Nevada Public Records laws and may be disclosed to the public upon request.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230221/c9a238cd/attachment.htm>


More information about the slurm-users mailing list