Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue?
On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote:
Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue?
Apparently the nvidia driver doesn't get loaded on reboot? There are multiple ways - add to /etc/modules, run modprobe nvidia via a @reboot crontab entry (or even run nvidia-smi in this way)...
Best, Steffen
nvidia-persistenced is something that gets installed by the nvidia driver. Setting it to start at boot time helps with slurmd being able to find the GPUs when it tries to start. This is just one web page that has some information about it.
https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persis...
Jeff
________________________________ From: Aziz Ogutlu via slurm-users slurm-users@lists.schedmd.com Sent: Monday, July 29, 2024 3:23 AM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] Slurm fails before nvidia-smi command
Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue?
-- Best regards, Aziz Öğütlü
Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. https://urldefense.com/v3/__http://www.eduline.com.tr__;!!LkSTlj0I!As7iQnglE... Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
After added nvidia-persistenced service, slurm did not fail.
Thanks for your help.
On 7/29/24 13:00, Sarlo, Jeffrey S wrote:
nvidia-persistenced is something that gets installed by the nvidia driver. Setting it to start at boot time helps with slurmd being able to find the GPUs when it tries to start. This is just one web page that has some information about it.
https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persis... https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html
Jeff
*From:* Aziz Ogutlu via slurm-users slurm-users@lists.schedmd.com *Sent:* Monday, July 29, 2024 3:23 AM *To:* slurm-users@schedmd.com slurm-users@schedmd.com *Subject:* [slurm-users] Slurm fails before nvidia-smi command Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue?
-- Best regards, Aziz Öğütlü
Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. https://urldefense.com/v3/__http://www.eduline.com.tr__;!!LkSTlj0I!As7iQnglE... https://urldefense.com/v3/__http://www.eduline.com.tr__;!!LkSTlj0I!As7iQnglEd9rKaSvbqCahkHBIziUjNdld-BP-8OKeAV2Nz5lq0VxXtENo_YpSnidSYn7ZafUZ2sE40XXFX4J05IYGdTOxg$ Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
It sounds to me perhaps as though your systemd units are starting in the wrong order, or don’t have appropriate dependencies set in them?
Tim
-- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
From: Aziz Ogutlu via slurm-users slurm-users@lists.schedmd.com Date: Monday, 29 July 2024 at 9:25 AM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] Slurm fails before nvidia-smi command Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue?
-- Best regards, Aziz Öğütlü
Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. www.eduline.com.trhttp://www.eduline.com.tr Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com ________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com