Taras,
FWIW, I installed that on the system used to build slurm. Not sure if you installed it on a node rather than have the updated libraries be used for the build process of slurm.
Sorry if that was not clear.
Brian
Thank you, Brian. Unfortunately, the installation of the latest nvidia-cuda-toolkit did not help. Slurmd -C still crashes.
Best regards,
Taras
From: Brian Andrus via slurm-users <slurm-users@lists.schedmd.com>
Sent: Tuesday, May 20, 2025 11:22
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: Crash in "slurmd -C" when latest NVIDIA drivers are used
External email: Use caution opening links or attachments
I can't speak to the exact cause, but I did find that updating my cuda toolkit fixed issues I saw with that awhile back.
I install:
libnvidia-compute-570-server
nvidia-cuda-toolkit
from https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/
That ends up grabbing the latest cuda bits which support the newer drivers.
Brian Andrus
On 5/19/2025 12:50 PM, Taras Shapovalov via slurm-users wrote:
Hello,
Does someone have idea why "slurmd -C" crashes when it unloads gpu_nrt.so with latest NVIDIA drivers (570 and 575)? We checked, there is no crash in cuda at the moment and gpu_nvml.so works fine, all nvml calls finish successfully, dlclose on gpu_nvml.so works fine. The crash does not depend whether real GPUs present or not.
Steps to reproduce:
- Install Ubuntu 24.04
- wget https://download.schedmd.com/slurm/slurm-24.11.4.tar.bz2
- tar fx ./slurm-24.11.4.tar.bz2
- cd slurm-24.11.4
apt-get install cuda-12-8 hwloc libmunge-dev -y- ./configure
make && make install Run "slurmd -C", or sometimes "slurmd -vvv -C" to get the crash.
Stack trace:#0 0x0000155555544b2a strlen (ld-linux-x86-64.so.2 + 0x28b2a)#1 0x000015555551fc08 __GI__dl_exception_create (ld-linux-x86-64.so.2 + 0x3c08)#2 0x000015555551d298 __GI__dl_signal_error (ld-linux-x86-64.so.2 + 0x1298)#3 0x000015555551e81d _dl_close (ld-linux-x86-64.so.2 + 0x281d)#4 0x000015555551d51c __GI__dl_catch_exception (ld-linux-x86-64.so.2 + 0x151c)#5 0x000015555551d669 _dl_catch_error (ld-linux-x86-64.so.2 + 0x1669)#6 0x0000155554e97c73 _dlerror_run (libc.so.6 + 0x97c73)#7 0x0000155554e979a6 __dlclose (libc.so.6 + 0x979a6)#8 0x0000155555388a25 gpu_plugin_fini (libslurmfull.so + 0x188a25)#9 0x000015555538f2ef gres_get_autodetected_gpus (libslurmfull.so + 0x18f2ef)#10 0x0000555555564828 _print_config (slurmd + 0x10828)#11 0x0000155554e2a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)#12 0x0000155554e2a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)#13 0x000055555555fc75 _start (slurmd + 0xbc75)
I don't really think the problem is in gpu_nrt itself, seems the problem is in memory corruption somewhere else, but I am not sure. The issue is reproduced constantly. A
Best regards,
Taras