<div dir="ltr">You all might be interested in a patch to the SPEC file, to not make the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at configure time. See <a href="https://bugs.schedmd.com/show_bug.cgi?id=7919#c3">https://bugs.schedmd.com/show_bug.cgi?id=7919#c3</a><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jan 26, 2021 at 3:17 PM Paul Raines <<a href="mailto:raines@nmr.mgh.harvard.edu">raines@nmr.mgh.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
You should check your jobs that allocated GPUs and make sure<br>
CUDA_VISIBLE_DEVICES is being set in the environment. This is a sign<br>
you GPU support is not really there but SLURM is just doing "generic"<br>
resource assignment.<br>
<br>
I have both GPU and non-GPU nodes. I build SLURM rpms twice. Once on a <br>
non-GPU node and use those RPMs to install on the non-GPU nodes. Then build <br>
again on the GPU node where CUDA is installed via the NVIDIA CUDA YUM repo <br>
rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm <br>
nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to the default <br>
RPM SPEC is needed. I just run<br>
<br>
rpmbuild --tb slurm-20.11.3.tar.bz2<br>
<br>
You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see<br>
that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the<br>
GPU node.<br>
<br>
-- Paul Raines (<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=</a> )<br>
<br>
<br>
<br>
On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:<br>
<br>
> In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:<br>
>> Personally, I think it's good that Slurm RPMs are now available through<br>
>> EPEL, although I won't be able to use them, and I'm sure many people on<br>
>> the list won't be able to either, since licensing issues prevent them from<br>
>> providing support for NVIDIA drivers, so those of us with GPUs on our<br>
>> clusters will still have to compile Slurm from source to include NVIDIA<br>
>> GPU support.<br>
><br>
> We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.<br>
> The Slurm GPU documentation seems to be<br>
> <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=</a> <br>
> We don't seem to have any problems scheduling jobs on GPUs, even though our <br>
> Slurm RPM build host doesn't have any NVIDIA software installed, as shown by <br>
> the command:<br>
> $ ldconfig -p | grep libnvidia-ml<br>
><br>
> I'm curious about Prentice's statement about needing NVIDIA libraries to be <br>
> installed when building Slurm RPMs, and I read the discussion in bug 9525,<br>
> <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=</a> <br>
> from which it seems that the problem was fixed in 20.02.6 and 20.11.<br>
><br>
> Question: Is there anything special that needs to be done when building Slurm <br>
> RPMs with NVIDIA GPU support?<br>
><br>
> Thanks,<br>
> Ole<br>
><br>
><br>
><br>
<br>
</blockquote></div>