[slurm-users] How to use Autodetect=nvml in gres.conf
Stephan Roth
stephan.roth at ee.ethz.ch
Fri Feb 7 16:57:34 UTC 2020
gpu_nvml.so links to libnvidia-ml.so:
$ ldd lib/slurm/gpu_nvml.so
...
libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
(0x00007f2d2bac8000)
...
When you run configure you'll see something along these lines:
On 07.02.20 17:03, dean.w.schulze at gmail.com wrote:
> I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name.
>
> Are you sure that slurm distributes nvidia binaries?
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Stephan Roth
> Sent: Friday, February 7, 2020 2:23 AM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] How to use Autodetect=nvml in gres.conf
>
> On 05.02.20 21:06, Dean Schulze wrote:
> > I need to dynamically configure gpus on my nodes. The gres.conf doc
> > says to use
> >
> > Autodetect=nvml
>
> That's all you need in gres.conf provided you don't configure any
> Gres=... entries for your nodes in your slurm.conf.
> If you do, make sure the string matches what NVML discovers, i.e.
> lowercase and underscores instead of spaces or dashes.
>
> The upside of configuring everything is you will be informed in case the
> automatically detected GPUs in a node don't match what you configured.
>
> > in gres.conf instead of adding configuration details to each gpu in
> > gres.conf. The docs aren't really clear about this because they show an
> > example with the details for each gpu:
> >
> > AutoDetect=nvml
> > Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1
> > Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1
> > Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3
> > Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3
> > Name=mps Count=200 File=/dev/nvidia0
> > Name=mps Count=200 File=/dev/nvidia1
> > Name=mps Count=100 File=/dev/nvidia2
> > Name=mps Count=100 File=/dev/nvidia3
> > Name=bandwidth Type=lustre Count=4G
> >
> > First Question: If I use Autodetect=nvml do I also need to specify
> > File= and Cores= for each gpu in gres.conf? I'm hoping that with
> > Autodetect=nvml that all I need is the Name= and Type= for each gpu.
> > Otherwise it's not clear what the purpose of setting Autodetect=nvml
> > would be.
> >
> > Second Question: I installed the CUDA tools from the binary
> > cuda_10.2.89_440.33.01_linux.run. When I restart slurmd with
> > Autodetect=nvml in gres.conf I get this error:
> >
> > fatal: We were configured to autodetect nvml functionality, but we
> > weren't able to find that lib when Slurm was configured.
> >
> > Is there something else I need to configure to tell slurmd how to use
> nvml?
>
> I guess the version of slurm you're using was linked against a version
> of NVML which has been overwritten by your installation of Cuda 10.2
>
> If that's the case there are various ways to solve that problem, but
> that depends on your reason to install Cuda 10.2.
>
> My recommendation is to use the Cuda version of your system matching
> your system's slurm package and to install Cuda 10.2 in a non-default
> location, provided you need to make it available on a cluster node.
>
> If people using your cluster ask for Cuda 10.2 they have the option of
> using a virtual conda environment and install Cuda 10.2 there.
>
>
> Cheers,
> Stephan
>
>
>
-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
-------------------------------------------------------------------
GPG Fingerprint: E2B9 1B4F 4D35 F233 BE12 1BE9 B423 4018 FBC0 EA17
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3956 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200207/31468476/attachment.bin>
More information about the slurm-users
mailing list