[slurm-users] Building Slurm RPMs with NVIDIA GPU support?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jan 26 20:10:23 UTC 2021
Thanks Paul!
On 26-01-2021 20:50, Paul Edmon wrote:
> In our RPM spec we use to build slurm we do the following additional
> things for GPU's:
>
> BuildRequires: cuda-nvml-devel-11-1
>
> the in the %build section we do:
>
> export CFLAGS="$CFLAGS
> -L/usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs/
> -I/usr/local/cuda-11.1/targets/x86_64-linux/include/"
>
> That ensures the cuda libs are installed and it directs slurm to where
> they are. After that configure should detect the nvml libs and link
> against them.
>
> I've attached our full spec that we use to build.
What I don't understand is, is it actually *required* to make the NVIDIA
libraries available to Slurm? I didn't do that, and I'm not aware of
any problems with our GPU nodes so far. Of course, our GPU nodes have
the libraries installed and the /dev/nvidia? devices are present.
Are some of Slurm's GPU features missing or broken without the
libraries? SchedMD's slurm.spec file doesn't mention any "--with
nvidia" (or similar) build options, so I'm really puzzled.
Most of our nodes don't have GPUs, so I wouldn't like to install
libraries on those nodes needlessly.
Thanks,
Ole
> On 1/26/2021 2:29 PM, Ole Holm Nielsen wrote:
>> In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
>>> Personally, I think it's good that Slurm RPMs are now available
>>> through EPEL, although I won't be able to use them, and I'm sure many
>>> people on the list won't be able to either, since licensing issues
>>> prevent them from providing support for NVIDIA drivers, so those of
>>> us with GPUs on our clusters will still have to compile Slurm from
>>> source to include NVIDIA GPU support.
>>
>> We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
>> The Slurm GPU documentation seems to be
>> https://slurm.schedmd.com/gres.html
>> We don't seem to have any problems scheduling jobs on GPUs, even
>> though our Slurm RPM build host doesn't have any NVIDIA software
>> installed, as shown by the command:
>> $ ldconfig -p | grep libnvidia-ml
>>
>> I'm curious about Prentice's statement about needing NVIDIA libraries
>> to be installed when building Slurm RPMs, and I read the discussion in
>> bug 9525,
>> https://bugs.schedmd.com/show_bug.cgi?id=9525
>> from which it seems that the problem was fixed in 20.02.6 and 20.11.
>>
>> Question: Is there anything special that needs to be done when
>> building Slurm RPMs with NVIDIA GPU support?
>>
>> Thanks,
>> Ole
More information about the slurm-users
mailing list