[slurm-users] Building Slurm RPMs with NVIDIA GPU support?

Paul Raines raines at nmr.mgh.harvard.edu
Tue Jan 26 20:11:16 UTC 2021


You should check your jobs that allocated GPUs and make sure
CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign
you GPU support is not really there but SLURM is just doing "generic"
resource assignment.

I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once on a 
non-GPU node and use those RPMs to install on the non-GPU nodes. Then build 
again on the GPU node where CUDA is installed via the NVIDIA CUDA YUM repo 
rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm 
nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to the default 
RPM SPEC is needed.  I just run

   rpmbuild --tb slurm-20.11.3.tar.bz2

You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see
that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the
GPU node.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:

> In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
>>  Personally, I think it's good that Slurm RPMs are now available through
>>  EPEL, although I won't be able to use them, and I'm sure many people on
>>  the list won't be able to either, since licensing issues prevent them from
>>  providing support for NVIDIA drivers, so those of us with GPUs on our
>>  clusters will still have to compile Slurm from source to include NVIDIA
>>  GPU support.
>
> We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
> The Slurm GPU documentation seems to be
> https://slurm.schedmd.com/gres.html
> We don't seem to have any problems scheduling jobs on GPUs, even though our 
> Slurm RPM build host doesn't have any NVIDIA software installed, as shown by 
> the command:
> $ ldconfig -p | grep libnvidia-ml
>
> I'm curious about Prentice's statement about needing NVIDIA libraries to be 
> installed when building Slurm RPMs, and I read the discussion in bug 9525,
> https://bugs.schedmd.com/show_bug.cgi?id=9525
> from which it seems that the problem was fixed in 20.02.6 and 20.11.
>
> Question: Is there anything special that needs to be done when building Slurm 
> RPMs with NVIDIA GPU support?
>
> Thanks,
> Ole
>
>
>



More information about the slurm-users mailing list