[slurm-users] How to use Autodetect=nvml in gres.conf

Dean Schulze dean.w.schulze at gmail.com
Fri Feb 7 18:49:12 UTC 2020


So this is related to the gpu/nvml plugin in the source code tree.  That
didn't get built because I didn't have the nvidia driver (really the
library libnvidia-ml.so) installed when I built the code.  I see in
config.log where it tries to find -lnvidia-ml and it skips building the
gpu.nvml plugin if it doesn't find it.

So in order to use Autodetect=nvml in gres.conf you have to install the
nvidia driver before building the source code.

I wish they would document some of these things.


On Fri, Feb 7, 2020 at 9:59 AM Stephan Roth <stephan.roth at ee.ethz.ch> wrote:

> gpu_nvml.so links to libnvidia-ml.so:
>
> $ ldd lib/slurm/gpu_nvml.so
>         ...
>         libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
> (0x00007f2d2bac8000)
>         ...
>
> When you run configure you'll see something along these lines:
>
>
> On 07.02.20 17:03, dean.w.schulze at gmail.com wrote:
> > I just checked the .deb package that I build from source and there is
> nothing in it that has nv or cuda in its name.
> >
> > Are you sure that slurm distributes nvidia binaries?
> >
> > -----Original Message-----
> > From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Stephan Roth
> > Sent: Friday, February 7, 2020 2:23 AM
> > To: slurm-users at lists.schedmd.com
> > Subject: Re: [slurm-users] How to use Autodetect=nvml in gres.conf
> >
> > On 05.02.20 21:06, Dean Schulze wrote:
> >   > I need to dynamically configure gpus on my nodes.  The gres.conf doc
> >   > says to use
> >   >
> >   > Autodetect=nvml
> >
> > That's all you need in gres.conf provided you don't configure any
> > Gres=... entries for your nodes in your slurm.conf.
> > If you do, make sure the string matches what NVML discovers, i.e.
> > lowercase and underscores instead of spaces or dashes.
> >
> > The upside of configuring everything is you will be informed in case the
> > automatically detected GPUs in a node don't match what you configured.
> >
> >   > in gres.conf instead of adding configuration details to each gpu in
> >   > gres.conf.  The docs aren't really clear about this because they
> show an
> >   > example with the details for each gpu:
> >   >
> >   > AutoDetect=nvml
> >   > Name=gpu Type=gp100  File=/dev/nvidia0 Cores=0,1
> >   > Name=gpu Type=gp100  File=/dev/nvidia1 Cores=0,1
> >   > Name=gpu Type=p6000  File=/dev/nvidia2 Cores=2,3
> >   > Name=gpu Type=p6000  File=/dev/nvidia3 Cores=2,3
> >   > Name=mps Count=200  File=/dev/nvidia0
> >   > Name=mps Count=200  File=/dev/nvidia1
> >   > Name=mps Count=100  File=/dev/nvidia2
> >   > Name=mps Count=100  File=/dev/nvidia3
> >   > Name=bandwidth Type=lustre Count=4G
> >   >
> >   > First Question:  If I use Autodetect=nvml do I also need to specify
> >   > File= and Cores= for each gpu in gres.conf?  I'm hoping that with
> >   > Autodetect=nvml that all I need is the Name= and Type= for each gpu.
> >   > Otherwise it's not clear what the purpose of setting Autodetect=nvml
> >   > would be.
> >   >
> >   > Second Question:  I installed the CUDA tools from the binary
> >   > cuda_10.2.89_440.33.01_linux.run.  When I restart slurmd with
> >   > Autodetect=nvml in gres.conf I get this error:
> >   >
> >   > fatal: We were configured to autodetect nvml functionality, but we
> >   > weren't able to find that lib when Slurm was configured.
> >   >
> >   > Is there something else I need to configure to tell slurmd how to use
> > nvml?
> >
> > I guess the version of slurm you're using was linked against a version
> > of NVML which has been overwritten by your installation of Cuda 10.2
> >
> > If that's the case there are various ways to solve that problem, but
> > that depends on your reason to install Cuda 10.2.
> >
> > My recommendation is to use the Cuda version of your system matching
> > your system's slurm package and to install Cuda 10.2 in a non-default
> > location, provided you need to make it available on a cluster node.
> >
> > If people using your cluster ask for Cuda 10.2 they have the option of
> > using a virtual conda environment and install Cuda 10.2 there.
> >
> >
> > Cheers,
> > Stephan
> >
> >
> >
>
>
> -------------------------------------------------------------------
> Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
> +4144 632 30 59  |  ETF D 104  |  Sternwartstrasse 7  | 8092 Zurich
> -------------------------------------------------------------------
> GPG Fingerprint: E2B9 1B4F 4D35 F233 BE12  1BE9 B423 4018 FBC0 EA17
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200207/50851239/attachment-0001.htm>


More information about the slurm-users mailing list