<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>I also compiled Slurm 20.11.8 to have GPU support in AlmaLinux
      8.4 but don't have any problem with  NVML detecting our A100s.</p>
    <p>¿Maybe the NVML library version used for Slurm compilation has to
      match the library version of the compute node where the GPU is?</p>
    <p>Also, I see that you're using Geforce_GTX. ¿Could it be that NVML
      only supports Tesla GPUs?</p>
    <p>This is my relevant Slurm configuration:</p>
    <pre>slurm.conf:</pre>
    <pre>GresTypes=gpu,mps</pre>
    <pre>NodeName=hpc-gpu[3-4].... Gres=gpu:A100:1</pre>
    <pre>gres.conf:</pre>
    <pre>NodeName=hpc-gpu[1-4] AutoDetect=nvml</pre>
    <p><br>
    </p>
    <p>and the NVIDIA part:</p>
    <pre>NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5
</pre>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">and this is what I see in the log:</div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">
      <pre>[2021-12-01T09:29:45.675] debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1</pre>
      <pre>[2021-12-01T09:29:45.675] debug:  gres/gpu: init: loaded</pre>
      <pre>[2021-12-01T09:29:45.675] debug:  gres/mps: init: loaded</pre>
      <pre>[2021-12-01T09:29:45.676] debug:  gpu/nvml: init: init: GPU NVML plugin loaded</pre>
      <pre>[2021-12-01T09:29:46.298] debug2: gpu/nvml: _nvml_init: Successfully initialized NVML</pre>
      <pre>[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 495.29.05</pre>
      <pre>[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.495.29.05</pre>
      <pre>[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64</pre>
      <pre>[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 1</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-pcie-40gb</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-4cbb41e9-296b-ba72-d345-aa41fd7a8842</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:33:0</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:21:00.0</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia0</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 16-23</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 16-23</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: Possible GPU Memory Frequencies (1):</pre>
      <pre>[2021-12-01T09:29:46.365] debug2: -------------------------------</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:     *1215 MHz [0]</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:         Possible GPU Graphics Frequencies (81):</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:         ---------------------------------</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           *1410 MHz [0]</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           *1395 MHz [1]</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           ...</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           *810 MHz [40]</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           ...</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           *225 MHz [79]</pre>
      <pre>[2021-12-01T09:29:46.365] debug2:           *210 MHz [80]</pre>
      <pre>[2021-12-01T09:29:46.555] debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML</pre>
      <pre>[2021-12-01T09:29:46.555] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected</pre>
      <pre>[2021-12-01T09:29:46.555] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs</pre>
      <pre>[2021-12-01T09:29:46.555] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:</pre>
      <pre>[2021-12-01T09:29:46.555] debug2:     GRES[gpu] Type:A100 Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE <a class="moz-txt-link-freetext" href="File:(null)">File:(null)</a></pre>
      <pre>[2021-12-01T09:29:46.556] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:</pre>
      <pre>[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia0">File:/dev/nvidia0</a></pre>
      <pre>[2021-12-01T09:29:46.556] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu</pre>
      <pre>[2021-12-01T09:29:46.556] debug2:     GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia0">File:/dev/nvidia0</a></pre>
      <pre>[2021-12-01T09:29:46.556] debug:  Gres GPU plugin: Final normalized gres.conf list:</pre>
      <pre>[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia0">File:/dev/nvidia0</a></pre>
      <pre>[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Initalized gres.conf list:</pre>
      <pre>[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia0">File:/dev/nvidia0</a></pre>
      <pre>[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Final gres.conf list:</pre>
      <pre>[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia0">File:/dev/nvidia0</a></pre>
      <pre>[2021-12-01T09:29:46.556] Gres Name=gpu Type=A100 Count=1</pre>
      <br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">Hope it helps.<br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">El 30/11/21 a las 16:12, Benjamin Nacar
      escribió:<br>
    </div>
    <blockquote type="cite" cite="mid:20211130101242.9608c47bd9ba2166a278463f@brown.edu">
      <pre class="moz-quote-pre" wrap="">Hi,

We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, so we recompiled Slurm 20.11 from the Debian source package with no modifications.

With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what we see on a 4-GPU host after restarting slurmd:

[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia3">File:/dev/nvidia3</a>
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia2">File:/dev/nvidia2</a>
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia1">File:/dev/nvidia1</a>
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE <a class="moz-txt-link-freetext" href="File:/dev/nvidia0">File:/dev/nvidia0</a>
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node configuration is not available".

Any idea what might be wrong?

Thanks,
~~ bnacar

</pre>
    </blockquote>
    <div class="moz-signature">-- <br>
      <table style="font-size:9pt;color:#444444;" cellspacing="0" cellpadding="0">
        <tbody>
          <tr>
            <td rowspan="4" style="padding-right:5px;"><a style="color:#444444;text-decoration:none;" href="http://citius.usc.es/"><span style="color:#444444"><img src="cid:part1.15B49817.84A17CFA@usc.es" alt="CiTIUS"></span></a></td>
            <td><a style="color:#222222;text-decoration:none;" href="http://citius.usc.es/v/fernando.guillen"><span style="color:#222222">Fernando Guillén Camba</span></a></td>
          </tr>
          <tr>
            <td><span style="color:#222222">Unidade de Xestión de
                Infraestruturas TIC</span></td>
          </tr>
          <tr>
            <td><a style="color:#444444;text-decoration:none;" href="mailto:fernando.guillen@usc.es"><img src="cid:part4.82FE0CDC.353F7E35@usc.es" alt="E-mail:" width="11" height="11"><span style="color:#222222">
                  fernando.guillen@usc.es</span></a> · <img src="cid:part6.6827556D.8BF2964E@usc.es" alt="Phone:" width="11" height="11"><span style="color:#222222"> +34
                881816409</span></td>
          </tr>
          <tr>
            <td><a style="color:#444444;text-decoration:none;" href="http://citius.usc.es"><span style="color:#444444;text-decoration:none;"><img src="cid:part7.F0675F8C.3D8D04EE@usc.es" alt="Website:" width="11" height="11"> citius.usc.es</span></a>
              · <a style="color:#444444;text-decoration:none;" href="http://twitter.com/citiususc"><span style="color:#444444;text-decoration:none;"><img src="cid:part9.6C45802B.5A419E41@usc.es" alt="Twitter:" width="11" height="11"> citiususc</span></a></td>
          </tr>
        </tbody>
      </table>
    </div>
  </body>
</html>