nvidia-persistenced is something that gets installed by the nvidia driver.  Setting it to start at boot time helps with slurmd being able to find the GPUs when it tries to start.  This is just one web page that has some information about it.

https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html

Jeff


From: Aziz Ogutlu via slurm-users <slurm-users@lists.schedmd.com>
Sent: Monday, July 29, 2024 3:23 AM
To: slurm-users@schedmd.com <slurm-users@schedmd.com>
Subject: [slurm-users] Slurm fails before nvidia-smi command
 
Hi there all,

We have Dell server with 2 x Nvidia H100 and running slurm on it. After
restart server if we do not write nvidia-smi command slurm fails. When
we run nvidia-smi && systemctl restart slurmd && systemctl restart
slurmctld , slurm queue begins. Do you have any idea about this error
and what can we do for this issue?

--
Best regards,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  https://urldefense.com/v3/__http://www.eduline.com.tr__;!!LkSTlj0I!As7iQnglEd9rKaSvbqCahkHBIziUjNdld-BP-8OKeAV2Nz5lq0VxXtENo_YpSnidSYn7ZafUZ2sE40XXFX4J05IYGdTOxg$
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61     Cep: +90 541 350 40 72


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com