<div dir="ltr"><div dir="ltr"><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 8, 2020 at 9:34 AM <<a href="mailto:dean.w.schulze@gmail.com" target="_blank">dean.w.schulze@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I believe in order to compile for nvml you'll have to compile on a system with an Nvidia gpu installed otherwise the Nvidia driver and libraries won't install on that system.<br></blockquote><div> </div><div>Yes our 3 compute nodes have 1 V100 each. So I can run:</div><div>ssh node001<br>Last login: Tue Apr 7 17:30:16 2020 <br># module load shared<br># module load nccl2-cuda10.1-gcc/2.5.6<br>Loading nccl2-cuda10.1-gcc/2.5.6<br> Loading requirement: gcc5/5.5.0 cuda10.1/toolkit/10.1.243<br></div><div><font face="monospace">nvidia-smi<br>Wed Apr 8 10:00:49 2020<br>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |<br>| N/A 28C P0 25W / 250W | 0MiB / 32510MiB | 0% E. Process |<br>+-------------------------------+----------------------+----------------------+<br><br>+-----------------------------------------------------------------------------+<br>| Processes: GPU Memory |<br>| GPU PID Type Process name Usage |<br>|=============================================================================|<br>| No running processes found |<br>+-----------------------------------------------------------------------------+</font><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
From: slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> On Behalf Of Christopher Samuel<br>
> How can I get this to work by loading the correct Bright module?<br>
<br>
You can't - you will need to recompile Slurm.<br>
<br>
The error says:<br>
<br>
Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.<br>
<br>
So when Slurm was built the libraries you are telling it to use now were not detected and so the configure script disabled that functionality as it would not otherwise have been able to compile.<br></blockquote><div><br></div><div>But it's clearly there as noted in my previous reply. From <a href="https://slurm.schedmd.com/gres.html#MPS_Management">https://slurm.schedmd.com/gres.html#MPS_Management</a></div><div><br></div>"If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration, configuration details will automatically be filled in for any system-detected NVIDIA GPU. This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect."</div><div class="gmail_quote"><br></div><div class="gmail_quote">So there isn't a way to have the "configuration details [will] automatically [be] filled in for any system-detected NVIDIA GPU. "?</div><div class="gmail_quote"><br></div><div class="gmail_quote">Also the page says this:<br>"By default, all system-detected devices are added to the node. However, if Type and File in gres.conf match a GPU on the system, any other properties explicitly specified (e.g. Cores or Links) can be double-checked against it. If the system-detected GPU differs from its matching GPU configuration, then the GPU is omitted from the node with an error. This allows gres.conf to serve as an optional sanity check and notifies administrators of any unexpected changes in GPU properties."</div><div class="gmail_quote"><br></div><div class="gmail_quote">How does "
system-detected devices" work here? How can I get "Type and File in gres.conf (to) match a GPU on the system"?</div></div>