On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote:
Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue?
Apparently the nvidia driver doesn't get loaded on reboot? There are multiple ways - add to /etc/modules, run modprobe nvidia via a @reboot crontab entry (or even run nvidia-smi in this way)...
Best, Steffen