Slurm and NVIDIA NVML

List overview All Threads
Download

newer

older

First setup of slurm with a GPU...

Slurm release candidate version...

Matthias Leopold

13 Nov 2024 13 Nov '24

11:19 a.m.

Hi,

I'm trying to compile Slurm with NVIDIA NVML support, but the result is unexpected. I get /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so, but when I do "ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so" there is no reference to /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (which I would expect).

~$ ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so linux-vdso.so.1 (0x00007ffd9a3f4000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0bc2c06000) /lib64/ld-linux-x86-64.so.2 (0x00007f0bc2e47000)

/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is present during compilation. Also I can see that NVML headers where found in config.status (else I wouldn't get gpu_nvml.so at all to my understanding).

Our old cluster was deployed with NVIDIA deepops (which compiles Slurm on every node) and also has NVML support. There ldd brings the expected result

~$ ldd /usr/local/lib/slurm/gpu_nvml.so ... libnvidia-ml.so.1 => /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (0x00007f3b10120000) ...

I can't test actual functionality with my new binaries because I don't have a node with GPUs yet.

Am I missing something?

thank you Matthias

Show replies by date

Joshua Randall

13 Nov 13 Nov

1:58 p.m.

Hi Matthias,

Just another user here, but we did notice similar behaviour on our cluster with NVIDIA GPU nodes. For this cluster, we built slurm 24.05.1 deb packages from source ourselves on Ubuntu 22.04 with the `libnvidia-ml-dev` package installed directly from the Ubuntu package archive (using the mk-build-deps / debuild method described here: https://slurm.schedmd.com/quickstart_admin.html#debuild)

In our cluster, the dynamic object dependencies for the gpu_nvml.so shared object file looks the same as yours (they do not show a dependency on /lib/x86_64-linux-gnu/libnvidia-ml.so.1 though we do have it available): ``` ubuntu@gpu0:~$ ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so linux-vdso.so.1 (0x00007ffe8c3b4000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f76301a9000) /lib64/ld-linux-x86-64.so.2 (0x00007f76303ef000) ```

However, NVML autodetection is working: ``` ubuntu@gpu0:~$ sudo grep nvml /var/log/slurm/slurmd.log | tail -n 1 [2024-11-05T16:09:06.359] gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected ```

I can also confirm that NVML library functions are being referenced from gpu_nvml.so (but are undefined therein): ``` ubuntu@gpu0:~$ objdump -T /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so | grep nvmlInit_v2 0000000000000000 D *UND* 0000000000000000 Base nvmlInit_v2 ```

It looks like at some point, slurm has moved to a model where the NVML library (libnvidia-ml.so) is autodetected and dlopen'ed prior to being needed by the plugin, so the plugins can now assume that it will be preloaded if available and no longer need to have a shared library dependency on it: https://github.com/SchedMD/slurm/blob/slurm-24-05-1-1/src/interfaces/gpu.c#L...

Cheers,

Josh.

-- Dr. Joshua C. Randall Principal Software Engineer Altos Labs email: jrandall@altoslabs.com

On Wed, Nov 13, 2024 at 10:21 AM Matthias Leopold via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hi,

I'm trying to compile Slurm with NVIDIA NVML support, but the result is unexpected. I get /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so, but when I do "ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so" there is no reference to /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (which I would expect).

~$ ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so linux-vdso.so.1 (0x00007ffd9a3f4000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0bc2c06000) /lib64/ld-linux-x86-64.so.2 (0x00007f0bc2e47000)

/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is present during compilation. Also I can see that NVML headers where found in config.status (else I wouldn't get gpu_nvml.so at all to my understanding).

Our old cluster was deployed with NVIDIA deepops (which compiles Slurm on every node) and also has NVML support. There ldd brings the expected result

~$ ldd /usr/local/lib/slurm/gpu_nvml.so ... libnvidia-ml.so.1 => /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (0x00007f3b10120000) ...

I can't test actual functionality with my new binaries because I don't have a node with GPUs yet.

Am I missing something?

thank you Matthias

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT

Matthias Leopold

4:37 p.m.

New subject: [EXTERN] Re: Slurm and NVIDIA NVML

Hi Josh,

thanks for reply, that's very helpful. I used exact same compilation setup as you did, I could have mentioned that. But this gives extra confidence. So I will just accept current situation and test it as soon as I have gpus available.

Best, Matthias

Am 13.11.24 um 13:58 schrieb Joshua Randall via slurm-users:

...

Hi Matthias,

Just another user here, but we did notice similar behaviour on our cluster with NVIDIA GPU nodes. For this cluster, we built slurm 24.05.1 deb packages from source ourselves on Ubuntu 22.04 with the `libnvidia- ml-dev` package installed directly from the Ubuntu package archive (using the mk-build-deps / debuild method described here: https:// slurm.schedmd.com/quickstart_admin.html#debuild <https:// slurm.schedmd.com/quickstart_admin.html#debuild>)

In our cluster, the dynamic object dependencies for the gpu_nvml.so shared object file looks the same as yours (they do not show a dependency on /lib/x86_64-linux-gnu/libnvidia-ml.so.1 though we do have it available):
ubuntu@gpu0:~$ ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so
         linux-vdso.so.1 (0x00007ffe8c3b4000)
         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f76301a9000)
         /lib64/ld-linux-x86-64.so.2 (0x00007f76303ef000)
However, NVML autodetection is working:
ubuntu@gpu0:~$ sudo grep nvml /var/log/slurm/slurmd.log  | tail -n 1
[2024-11-05T16:09:06.359] gpu/nvml: _get_system_gpu_list_nvml: 8 GPU 
system device(s) detected
I can also confirm that NVML library functions are being referenced from gpu_nvml.so (but are undefined therein):
ubuntu@gpu0:~$ objdump -T /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so | 
grep nvmlInit_v2
0000000000000000      D  *UND*  0000000000000000  Base        nvmlInit_v2
It looks like at some point, slurm has moved to a model where the NVML library (libnvidia-ml.so) is autodetected and dlopen'ed prior to being needed by the plugin, so the plugins can now assume that it will be preloaded if available and no longer need to have a shared library dependency on it: https://github.com/SchedMD/slurm/blob/slurm-24-05-1-1/ src/interfaces/gpu.c#L80-L101 <https://github.com/SchedMD/slurm/blob/ slurm-24-05-1-1/src/interfaces/gpu.c#L80-L101>

Cheers,

Josh.

-- Dr. Joshua C. Randall Principal Software Engineer Altos Labs email: jrandall@altoslabs.com mailto:jrandall@altoslabs.com

On Wed, Nov 13, 2024 at 10:21 AM Matthias Leopold via slurm-users <slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com> wrote:
Hi,

I'm trying to compile Slurm with NVIDIA NVML support, but the result is
unexpected. I get /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so, but when
I do "ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so" there is no
reference to /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (which I would
expect).

~$ ldd  /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so
          linux-vdso.so.1 (0x00007ffd9a3f4000)
          libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
(0x00007f0bc2c06000)
          /lib64/ld-linux-x86-64.so.2 (0x00007f0bc2e47000)

/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is present during compilation.
Also I can see that NVML headers where found in config.status (else I
wouldn't get gpu_nvml.so at all to my understanding).

Our old cluster was deployed with NVIDIA deepops (which compiles Slurm
on every node) and also has NVML support. There ldd brings the expected
result

~$ ldd /usr/local/lib/slurm/gpu_nvml.so
...
libnvidia-ml.so.1 => /lib/x86_64-linux-gnu/libnvidia-ml.so.1
(0x00007f3b10120000)
...

I can't test actual functionality with my new binaries because I don't
have a node with GPUs yet.

Am I missing something?

thank you
Matthias


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
<mailto:slurm-users-leave@lists.schedmd.com>
Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT

-- Medizinische Universität Wien Matthias Leopold IT Services & strategisches Informationsmanagement Enterprise Technology & Infrastructure Spitalgasse 23, 1090 Wien T: +43 1 40160 21241 matthias.leopold@meduniwien.ac.at https://www.meduniwien.ac.at

474

Age (days ago)

474

Last active (days ago)

slurm-users@lists.schedmd.com

2 comments

2 participants

tags (0)

participants (2)

Joshua Randall
Matthias Leopold