[slurm-users] Building Slurm RPMs with NVIDIA GPU support?

Tue Jan 26 21:04:09 UTC 2021

That is correct.  I think NVML has some additional features but in terms 
of actually scheduling them what you have should work. They will just be 
treated as normal gres resources.

-Paul Edmon-

On 1/26/2021 3:55 PM, Ole Holm Nielsen wrote:
> On 26-01-2021 21:36, Paul Edmon wrote:
>> You can include gpu's as gres in slurm with out compiling 
>> specifically against nvml.  You only really need to do that if you 
>> want to use the autodetection features that have been built into the 
>> slurm.  We don't really use any of those features at our site, we 
>> only started building against nvml to future proof ourselves for 
>> when/if those features become relevant to us.
>
> Thanks for this clarification about not actually *requiring* the 
> NVIDIA NVML library in the Slurm build!
>
> Now I'm seeing this description in https://slurm.schedmd.com/gres.html 
> about automatic GPU configuration by Slurm:
>
>> If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management 
>> Library (NVML) is installed on the node and was found during Slurm 
>> configuration, configuration details will automatically be filled in 
>> for any system-detected NVIDIA GPU. This removes the need to 
>> explicitly configure GPUs in gres.conf, though the Gres= line in 
>> slurm.conf is still required in order to tell slurmctld how many GRES 
>> to expect. 
>
> I have defined our GPUs manually in gres.conf with File=/dev/nvidia? 
> lines, so it would seem that this obviates the need for NVML.  Is this 
> the correct conclusion?
>
> /Ole
>
>
>> To me at least it would be nicer if there was a less hacky way of 
>> getting it to do that.  Arguably Slurm should dynamically link 
>> against the libs it needs or not depending on the node.  We hit this 
>> issue with Lustre/IB as well where you have to roll a separate slurm 
>> for each type of node you have if you want these which is hardly ideal.
>>
>> -Paul Edmon-
>>
>> On 1/26/2021 3:24 PM, Robert Kudyba wrote:
>>> You all might be interested in a patch to the SPEC file, to not make 
>>> the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled 
>>> at configure time. See 
>>> https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 
>>> <https://bugs.schedmd.com/show_bug.cgi?id=7919#c3>
>>>
>>> On Tue, Jan 26, 2021 at 3:17 PM Paul Raines 
>>> <raines at nmr.mgh.harvard.edu <mailto:raines at nmr.mgh.harvard.edu>> wrote:
>>>
>>>
>>>     You should check your jobs that allocated GPUs and make sure
>>>     CUDA_VISIBLE_DEVICES is being set in the environment. This is a 
>>> sign
>>>     you GPU support is not really there but SLURM is just doing 
>>> "generic"
>>>     resource assignment.
>>>
>>>     I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once
>>>     on a
>>>     non-GPU node and use those RPMs to install on the non-GPU nodes.
>>>     Then build
>>>     again on the GPU node where CUDA is installed via the NVIDIA CUDA
>>>     YUM repo
>>>     rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
>>>     nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to
>>>     the default
>>>     RPM SPEC is needed.  I just run
>>>
>>>        rpmbuild --tb slurm-20.11.3.tar.bz2
>>>
>>>     You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml'
>>>     and see
>>>     that /usr/lib64/slurm/gpu_nvml.so only exists on the one built 
>>> on the
>>>     GPU node.
>>>
>>>     -- Paul Raines
>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=>
>>>     )
>>>
>>>
>>>
>>>     On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:
>>>
>>>     > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
>>>     >>  Personally, I think it's good that Slurm RPMs are now
>>>     available through
>>>     >>  EPEL, although I won't be able to use them, and I'm sure many
>>>     people on
>>>     >>  the list won't be able to either, since licensing issues
>>>     prevent them from
>>>     >>  providing support for NVIDIA drivers, so those of us with GPUs
>>>     on our
>>>     >>  clusters will still have to compile Slurm from source to
>>>     include NVIDIA
>>>     >>  GPU support.
>>>     >
>>>     > We're running Slurm 20.02.6 and recently added some NVIDIA GPU
>>>     nodes.
>>>     > The Slurm GPU documentation seems to be
>>>     >
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=>
>>>
>>>     > We don't seem to have any problems scheduling jobs on GPUs, even
>>>     though our
>>>     > Slurm RPM build host doesn't have any NVIDIA software installed,
>>>     as shown by
>>>     > the command:
>>>     > $ ldconfig -p | grep libnvidia-ml
>>>     >
>>>     > I'm curious about Prentice's statement about needing NVIDIA
>>>     libraries to be
>>>     > installed when building Slurm RPMs, and I read the discussion in
>>>     bug 9525,
>>>     >
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=>
>>>
>>>     > from which it seems that the problem was fixed in 20.02.6 and 
>>> 20.11.
>>>     >
>>>     > Question: Is there anything special that needs to be done when
>>>     building Slurm
>>>     > RPMs with NVIDIA GPU support?
>>>     >
>>>     > Thanks,
>>>     > Ole
>