<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>You can include gpu's as gres in slurm with out compiling
      specifically against nvml.  You only really need to do that if you
      want to use the autodetection features that have been built into
      the slurm.  We don't really use any of those features at our site,
      we only started building against nvml to future proof ourselves
      for when/if those features become relevant to us.</p>
    <p>To me at least it would be nicer if there was a less hacky way of
      getting it to do that.  Arguably Slurm should dynamically link
      against the libs it needs or not depending on the node.  We hit
      this issue with Lustre/IB as well where you have to roll a
      separate slurm for each type of node you have if you want these
      which is hardly ideal.<br>
    </p>
    <p>-Paul Edmon-<br>
    </p>
    <div class="moz-cite-prefix">On 1/26/2021 3:24 PM, Robert Kudyba
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAFHi+KTgdz8QhbAqML9cWbfqRUEFwe1AJZvXq4FEAPFah3LVFA@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">You all might be interested in a patch to the SPEC
        file, to not make the slurm RPMs depend on libnvidia-ml.so, even
        if it's been enabled at configure time. See <a
          href="https://bugs.schedmd.com/show_bug.cgi?id=7919#c3"
          moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=7919#c3</a><br>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Tue, Jan 26, 2021 at 3:17
          PM Paul Raines <<a href="mailto:raines@nmr.mgh.harvard.edu"
            moz-do-not-send="true">raines@nmr.mgh.harvard.edu</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
          You should check your jobs that allocated GPUs and make sure<br>
          CUDA_VISIBLE_DEVICES is being set in the environment.  This is
          a sign<br>
          you GPU support is not really there but SLURM is just doing
          "generic"<br>
          resource assignment.<br>
          <br>
          I have both GPU and non-GPU nodes.  I build SLURM rpms twice.
          Once on a <br>
          non-GPU node and use those RPMs to install on the non-GPU
          nodes. Then build <br>
          again on the GPU node where CUDA is installed via the NVIDIA
          CUDA YUM repo <br>
          rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
          <br>
          nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods
          to the default <br>
          RPM SPEC is needed.  I just run<br>
          <br>
             rpmbuild --tb slurm-20.11.3.tar.bz2<br>
          <br>
          You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep
          nvml' and see<br>
          that /usr/lib64/slurm/gpu_nvml.so only exists on the one built
          on the<br>
          GPU node.<br>
          <br>
          -- Paul Raines (<a
href="https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e="
            rel="noreferrer" target="_blank" moz-do-not-send="true">https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=</a>
          )<br>
          <br>
          <br>
          <br>
          On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:<br>
          <br>
          > In another thread, On 26-01-2021 17:44, Prentice Bisbal
          wrote:<br>
          >>  Personally, I think it's good that Slurm RPMs are
          now available through<br>
          >>  EPEL, although I won't be able to use them, and I'm
          sure many people on<br>
          >>  the list won't be able to either, since licensing
          issues prevent them from<br>
          >>  providing support for NVIDIA drivers, so those of us
          with GPUs on our<br>
          >>  clusters will still have to compile Slurm from
          source to include NVIDIA<br>
          >>  GPU support.<br>
          ><br>
          > We're running Slurm 20.02.6 and recently added some
          NVIDIA GPU nodes.<br>
          > The Slurm GPU documentation seems to be<br>
          > <a
href="https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e="
            rel="noreferrer" target="_blank" moz-do-not-send="true">https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=</a>
          <br>
          > We don't seem to have any problems scheduling jobs on
          GPUs, even though our <br>
          > Slurm RPM build host doesn't have any NVIDIA software
          installed, as shown by <br>
          > the command:<br>
          > $ ldconfig -p | grep libnvidia-ml<br>
          ><br>
          > I'm curious about Prentice's statement about needing
          NVIDIA libraries to be <br>
          > installed when building Slurm RPMs, and I read the
          discussion in bug 9525,<br>
          > <a
href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e="
            rel="noreferrer" target="_blank" moz-do-not-send="true">https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=</a>
          <br>
          > from which it seems that the problem was fixed in 20.02.6
          and 20.11.<br>
          ><br>
          > Question: Is there anything special that needs to be done
          when building Slurm <br>
          > RPMs with NVIDIA GPU support?<br>
          ><br>
          > Thanks,<br>
          > Ole<br>
          ><br>
          ><br>
          ><br>
          <br>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>