I can't speak to the exact cause, but I did find that updating my cuda toolkit fixed issues I saw with that awhile back.
I install:
libnvidia-compute-570-server
nvidia-cuda-toolkit
from
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/
That ends up grabbing the latest cuda bits which support the newer drivers.
Brian Andrus
Hello,
Does someone have idea why "slurmd -C" crashes when it unloads gpu_nrt.so with latest NVIDIA drivers (570 and 575)? We checked, there is no crash in cuda at the moment and gpu_nvml.so works fine, all nvml calls finish successfully, dlclose on gpu_nvml.so works fine. The crash does not depend whether real GPUs present or not.
Steps to reproduce:
- Install Ubuntu 24.04
- wget https://download.schedmd.com/slurm/slurm-24.11.4.tar.bz2
- tar fx ./slurm-24.11.4.tar.bz2
- cd slurm-24.11.4
apt-get install cuda-12-8 hwloc libmunge-dev -y- ./configure
make && make install Run "slurmd -C", or sometimes "slurmd -vvv -C" to get the crash.
Stack trace:#0 0x0000155555544b2a strlen (ld-linux-x86-64.so.2 + 0x28b2a)#1 0x000015555551fc08 __GI__dl_exception_create (ld-linux-x86-64.so.2 + 0x3c08)#2 0x000015555551d298 __GI__dl_signal_error (ld-linux-x86-64.so.2 + 0x1298)#3 0x000015555551e81d _dl_close (ld-linux-x86-64.so.2 + 0x281d)#4 0x000015555551d51c __GI__dl_catch_exception (ld-linux-x86-64.so.2 + 0x151c)#5 0x000015555551d669 _dl_catch_error (ld-linux-x86-64.so.2 + 0x1669)#6 0x0000155554e97c73 _dlerror_run (libc.so.6 + 0x97c73)#7 0x0000155554e979a6 __dlclose (libc.so.6 + 0x979a6)#8 0x0000155555388a25 gpu_plugin_fini (libslurmfull.so + 0x188a25)#9 0x000015555538f2ef gres_get_autodetected_gpus (libslurmfull.so + 0x18f2ef)#10 0x0000555555564828 _print_config (slurmd + 0x10828)#11 0x0000155554e2a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)#12 0x0000155554e2a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)#13 0x000055555555fc75 _start (slurmd + 0xbc75)
I don't really think the problem is in gpu_nrt itself, seems the problem is in memory corruption somewhere else, but I am not sure. The issue is reproduced constantly. A
Best regards,
Taras