[slurm-users] Autodetect of nvml is not working in gres.conf

Shunran Zhang szhang at ngs.gen-info.osaka-u.ac.jp
Thu Nov 30 14:54:13 UTC 2023


Hi all,

If you could offer a little bit more details on your OS and Slurm version
that might shed some light.

There is an interesting detail about the NVML package if you are using
RHEL-like OS.
The NVML detection part of the slurm library (/usr/lib64/slurm/gpu_nvml.so)
is linked against the /lib64/libnvidia-ml.so.1 to do the actual detection.
If you do a simple nvidia driver installation that pulls in
nvidia-driver-NVML from cuda-rhel8-x86_64 repository,
this package would install /lib64/libnvidia-ml.so.1 as a symlink to
/lib64/libnvidia-ml.so.<your driver version>.
In this setup, as the linked library is present, the code would not crash.

However, interestingly the package mentioned above missed another symlink:
the /lib64/libnvidia-ml.so to /lib64/libnvidia-ml.so.<your driver version>.
Take a look at the following line of the Slurm source code (I just used the
master branch but git blame says it comes a long way):

"""
if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))
"""
Link to source code:
https://github.com/SchedMD/slurm/blob/master/src/interfaces/gpu.c#L100

So even though the nvidia-driver-NVML is installed, and the system was able
to find the linked library as it was linked against libnvidia-ml.so.1,
as the libnvidia-ml.so link is not provided there, the dlopen fails for the
file not found, thus the error message you posted follows.

In our case, I just manually created the missing symlink by ln -s
/lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so, and the NVML worked as
expected.

I kind of wonder if such an issue arose from the packaging issue on the
NVIDIA side, or if it should be filed as a bug of SLURM code only checking
for the so library without any versioning suffix.

Your case might be different, but I think as the error message is a direct
result of slurm unable to find /lib64/libnvidia-ml.so, you should take
a look at your setup to see if such so file is installed or not - if not,
install the package, otherwise create the missing symlink.

Sincerely,
S. Zhang

2023年11月30日(木) 23:23 Ravi Konila <ravibhatk at gmail.com>:

> Hello,
>
> My gres.conf has AutoDetect=nvml
> when I restart slurmd service I do get
>
> *fatal: We were configured to autodetect nvml functionality, but we
> weren't able to find that lib when Slurm was configured.*
>
> Referred few links to solve along with slurm-users email archives but
> could not understand much.
>
> Can someone help me with this one. I am using DGX A100 Server which has 4
> numbers of A100 80GB GPUs.
>
> With Warm Regards
> Ravi Konila
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231130/f4a6ca7d/attachment.htm>


More information about the slurm-users mailing list