[slurm-users] Building Slurm RPMs with NVIDIA GPU support?

Paul Edmon pedmon at cfa.harvard.edu
Tue Jan 26 20:36:27 UTC 2021


You can include gpu's as gres in slurm with out compiling specifically 
against nvml.  You only really need to do that if you want to use the 
autodetection features that have been built into the slurm.  We don't 
really use any of those features at our site, we only started building 
against nvml to future proof ourselves for when/if those features become 
relevant to us.

To me at least it would be nicer if there was a less hacky way of 
getting it to do that.  Arguably Slurm should dynamically link against 
the libs it needs or not depending on the node.  We hit this issue with 
Lustre/IB as well where you have to roll a separate slurm for each type 
of node you have if you want these which is hardly ideal.

-Paul Edmon-

On 1/26/2021 3:24 PM, Robert Kudyba wrote:
> You all might be interested in a patch to the SPEC file, to not make 
> the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at 
> configure time. See https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 
> <https://bugs.schedmd.com/show_bug.cgi?id=7919#c3>
>
> On Tue, Jan 26, 2021 at 3:17 PM Paul Raines 
> <raines at nmr.mgh.harvard.edu <mailto:raines at nmr.mgh.harvard.edu>> wrote:
>
>
>     You should check your jobs that allocated GPUs and make sure
>     CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign
>     you GPU support is not really there but SLURM is just doing "generic"
>     resource assignment.
>
>     I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once
>     on a
>     non-GPU node and use those RPMs to install on the non-GPU nodes.
>     Then build
>     again on the GPU node where CUDA is installed via the NVIDIA CUDA
>     YUM repo
>     rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
>     nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to
>     the default
>     RPM SPEC is needed.  I just run
>
>        rpmbuild --tb slurm-20.11.3.tar.bz2
>
>     You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml'
>     and see
>     that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the
>     GPU node.
>
>     -- Paul Raines
>     (https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=
>     <https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=>
>     )
>
>
>
>     On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:
>
>     > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
>     >>  Personally, I think it's good that Slurm RPMs are now
>     available through
>     >>  EPEL, although I won't be able to use them, and I'm sure many
>     people on
>     >>  the list won't be able to either, since licensing issues
>     prevent them from
>     >>  providing support for NVIDIA drivers, so those of us with GPUs
>     on our
>     >>  clusters will still have to compile Slurm from source to
>     include NVIDIA
>     >>  GPU support.
>     >
>     > We're running Slurm 20.02.6 and recently added some NVIDIA GPU
>     nodes.
>     > The Slurm GPU documentation seems to be
>     >
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=>
>
>     > We don't seem to have any problems scheduling jobs on GPUs, even
>     though our
>     > Slurm RPM build host doesn't have any NVIDIA software installed,
>     as shown by
>     > the command:
>     > $ ldconfig -p | grep libnvidia-ml
>     >
>     > I'm curious about Prentice's statement about needing NVIDIA
>     libraries to be
>     > installed when building Slurm RPMs, and I read the discussion in
>     bug 9525,
>     >
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=>
>
>     > from which it seems that the problem was fixed in 20.02.6 and 20.11.
>     >
>     > Question: Is there anything special that needs to be done when
>     building Slurm
>     > RPMs with NVIDIA GPU support?
>     >
>     > Thanks,
>     > Ole
>     >
>     >
>     >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210126/1147c8a6/attachment.htm>


More information about the slurm-users mailing list