[slurm-users] How to use Autodetect=nvml in gres.conf

Fri Feb 7 16:03:52 UTC 2020

I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name.

Are you sure that slurm distributes nvidia binaries?

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Stephan Roth
Sent: Friday, February 7, 2020 2:23 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] How to use Autodetect=nvml in gres.conf

On 05.02.20 21:06, Dean Schulze wrote:
 > I need to dynamically configure gpus on my nodes.  The gres.conf doc
 > says to use
 >
 > Autodetect=nvml

That's all you need in gres.conf provided you don't configure any 
Gres=... entries for your nodes in your slurm.conf.
If you do, make sure the string matches what NVML discovers, i.e. 
lowercase and underscores instead of spaces or dashes.

The upside of configuring everything is you will be informed in case the 
automatically detected GPUs in a node don't match what you configured.

 > in gres.conf instead of adding configuration details to each gpu in
 > gres.conf.  The docs aren't really clear about this because they show an
 > example with the details for each gpu:
 >
 > AutoDetect=nvml
 > Name=gpu Type=gp100  File=/dev/nvidia0 Cores=0,1
 > Name=gpu Type=gp100  File=/dev/nvidia1 Cores=0,1
 > Name=gpu Type=p6000  File=/dev/nvidia2 Cores=2,3
 > Name=gpu Type=p6000  File=/dev/nvidia3 Cores=2,3
 > Name=mps Count=200  File=/dev/nvidia0
 > Name=mps Count=200  File=/dev/nvidia1
 > Name=mps Count=100  File=/dev/nvidia2
 > Name=mps Count=100  File=/dev/nvidia3
 > Name=bandwidth Type=lustre Count=4G
 >
 > First Question:  If I use Autodetect=nvml do I also need to specify
 > File= and Cores= for each gpu in gres.conf?  I'm hoping that with
 > Autodetect=nvml that all I need is the Name= and Type= for each gpu.
 > Otherwise it's not clear what the purpose of setting Autodetect=nvml
 > would be.
 >
 > Second Question:  I installed the CUDA tools from the binary
 > cuda_10.2.89_440.33.01_linux.run.  When I restart slurmd with
 > Autodetect=nvml in gres.conf I get this error:
 >
 > fatal: We were configured to autodetect nvml functionality, but we
 > weren't able to find that lib when Slurm was configured.
 >
 > Is there something else I need to configure to tell slurmd how to use 
nvml?

I guess the version of slurm you're using was linked against a version 
of NVML which has been overwritten by your installation of Cuda 10.2

If that's the case there are various ways to solve that problem, but 
that depends on your reason to install Cuda 10.2.

My recommendation is to use the Cuda version of your system matching 
your system's slurm package and to install Cuda 10.2 in a non-default 
location, provided you need to make it available on a cluster node.

If people using your cluster ask for Cuda 10.2 they have the option of 
using a virtual conda environment and install Cuda 10.2 there.

Cheers,
Stephan