<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <div class="moz-cite-prefix">Hi Jim,<br>
      <br>
      I don't know if it makes a difference, but I only ever use the
      complete numeric suffix within brackets, as in<br>
      <pre>sjc01enadsapp[01-08]</pre>
      Otherwise I'd raise the debug level of slurmd to maximum by
      setting<br>
      <pre>SlurmdDebug=debug5</pre>
      in <i>slurm.conf</i>, tail <i>SlurmdLogFile</i> on a GPU node
      and then restart <i>slurmd</i> there.<br>
      This might shed some light on what goes wrong.<br>
      <br>
      Cheers,<br>
      Stephan</div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">On 03.05.22 20:51, Jim Kavitsky wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:PH0PR18MB49224AB867E719611201C539B4C09@PH0PR18MB4922.namprd18.prod.outlook.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style>@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
        {font-family:Menlo;
        panose-1:2 11 6 9 3 8 4 2 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:10.0pt;
        font-family:"Calibri",sans-serif;}p.p1, li.p1, div.p1
        {mso-style-name:p1;
        margin:0in;
        font-size:10.5pt;
        font-family:Menlo;
        color:black;}span.s1
        {mso-style-name:s1;}span.s2
        {mso-style-name:s2;
        color:#B42419;}span.apple-converted-space
        {mso-style-name:apple-converted-space;}.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}div.WordSection1
        {page:WordSection1;}</style>
      <style type="text/css">.style1 {font-family: "Times New Roman";}</style>
      <div class="WordSection1">
        <p class="MsoNormal"><span style="font-size:11.0pt">Whoops. Sent
            the first to an incorrect address….apologies if this shows
            up as a duplicate.<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt">-jimk<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
        <div style="border:none;border-top:solid #B5C4DF
          1.0pt;padding:3.0pt 0in 0in 0in">
          <p class="MsoNormal" style="margin-bottom:12.0pt"><b><span
                style="font-size:12.0pt;color:black">From:
              </span></b><span style="font-size:12.0pt;color:black">Jim
              Kavitsky <a class="moz-txt-link-rfc2396E" href="mailto:JimKavitsky@lucidmotors.com"><JimKavitsky@lucidmotors.com></a><br>
              <b>Date: </b>Tuesday, May 3, 2022 at 11:46 AM<br>
              <b>To: </b><a class="moz-txt-link-abbreviated" href="mailto:slurm-users@schedmd.com">slurm-users@schedmd.com</a>
              <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@schedmd.com"><slurm-users@schedmd.com></a><br>
              <b>Subject: </b>gres/gpu count lower than reported<o:p></o:p></span></p>
        </div>
        <p class="MsoNormal"><span style="font-size:10.5pt">Hello Fellow
            Slurm Admins,</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:10.5pt"> </span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="p1">I have a new Slurm installation that was working
          and running basic test jobs until I added gpu support. My
          worker nodes are now all in drain state, with
          <span class="s1">gres/gpu count reported lower than configured
            (0 < 4)</span><o:p></o:p></p>
        <p class="MsoNormal"><span style="font-size:10.5pt"> </span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt">This is in
            spite of the fact that nvidia-smi reports all four A100’s as
            active on each node. I have spent a good chunk of a week
            googling around for the solution to this, and trying
            variants of the gpu config lines/restarting daemons without
            any luck. <o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt"> <o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt">The relevant
            lines from my current config files are below. The head node
            and all workers have the same gres.conf and slurm.conf
            files. Can anyone suggest anything else I should be looking
            at or adding? I’m guessing that this is a problem that many
            have faced, and any guidance would be greatly appreciated.<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:10.5pt"> </span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="p1"><span class="s1">root@sjc01enadsapp00:/etc/slurm-llnl#
            grep gpu slurm.conf</span><o:p></o:p></p>
        <p class="p1"><span class="s1">GresTypes=</span><span class="s2"><b>gpu</b></span><o:p></o:p></p>
        <p class="p1"><span class="s1">NodeName=sjc01enadsapp0[1-8]
            RealMemory=2063731 Sockets=2 CoresPerSocket=16
            ThreadsPerCore=2 Gres=</span><span class="s2"><b>gpu</b></span><span
            class="s1">:4 State=UNKNOWN</span><o:p></o:p></p>
        <p class="p1"> <o:p></o:p></p>
        <p class="p1"><span class="s1">root@sjc01enadsapp00:/etc/slurm-llnl#
            cat gres.conf</span><o:p></o:p></p>
        <p class="p1"><span class="s1">NodeName=sjc01enadsapp0[1-8]
            Name=gpu File=/dev/nvidia[0-3]</span><o:p></o:p></p>
        <p class="p1"><span class="s1"> </span><o:p></o:p></p>
        <p class="p1"><span class="s1"> </span><o:p></o:p></p>
        <p class="p1"><span class="s1"> </span><o:p></o:p></p>
        <p class="p1"><span class="s1">root@sjc01enadsapp00:~# sinfo -N
            -o "%.20N %.15C %.10t %.10m %.15P %.15G %.75E"</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">            </span><span
            class="s1">NODELIST
          </span><span class="apple-converted-space">  </span><span
            class="s1">CPUS(A/I/O/T)</span><span
            class="apple-converted-space">     
          </span><span class="s1">STATE </span><span
            class="apple-converted-space">    </span>
          <span class="s1">MEMORY </span><span
            class="apple-converted-space">      </span><span class="s1">PARTITION</span><span
            class="apple-converted-space">           
          </span><span class="s1">GRES</span><span
            class="apple-converted-space">                             
                                                   
          </span><span class="s1">REASON</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp01
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp02
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp03
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp04
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp05
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp06
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp07
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"><span class="apple-converted-space">     </span><span
            class="s1">sjc01enadsapp08
          </span><span class="apple-converted-space">      </span><span
            class="s1">0/0/64/64</span><span
            class="apple-converted-space">     
          </span><span class="s1">drain</span><span
            class="apple-converted-space">    </span>
          <span class="s1">2063731</span><span
            class="apple-converted-space">        </span>
          <span class="s1">Primary* </span><span
            class="apple-converted-space">          </span>
          <span class="s1">gpu:4 </span><span
            class="apple-converted-space">                     
          </span><span class="s1">gres/gpu count reported lower than
            configured (0 < 4)</span><o:p></o:p></p>
        <p class="p1"> <o:p></o:p></p>
        <p class="p1"> <o:p></o:p></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">root@sjc01enadsapp07:~#
            nvidia-smi</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">Tue
            May  3 18:41:34 2022       </span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|
            NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA
            Version: 11.4     |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|-------------------------------+----------------------+----------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">| GPU 
            Name        Persistence-M| Bus-Id        Disp.A | Volatile
            Uncorr. ECC |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">| Fan 
            Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
            Compute M. |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    
                                      |                      |          
                MIG M. |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|===============================+======================+======================|</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|   0 
            NVIDIA A100-PCI...  On   | 00000000:17:00.0 Off |           
                    0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
              42C    P0    49W / 250W |      4MiB / 40536MiB |      0% 
                Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    
                                      |                      |          
              Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|   1 
            NVIDIA A100-PCI...  On   | 00000000:65:00.0 Off |           
                    0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
              41C    P0    48W / 250W |      4MiB / 40536MiB |      0% 
                Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    
                                      |                      |          
              Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|   2 
            NVIDIA A100-PCI...  On   | 00000000:CA:00.0 Off |           
                    0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
              35C    P0    44W / 250W |      4MiB / 40536MiB |      0% 
                Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    
                                      |                      |          
              Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|   3 
            NVIDIA A100-PCI...  On   | 00000000:E3:00.0 Off |           
                    0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
              38C    P0    45W / 250W |      4MiB / 40536MiB |      0% 
                Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    
                                      |                      |          
              Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">     
                                                                       
                         </span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|
            Processes:                                                 
                            |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|  GPU
              GI   CI        PID   Type   Process name                 
            GPU Memory |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|     
              ID   ID                                                  
            Usage      |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|=============================================================================|</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    0
              N/A  N/A      2179      G   /usr/lib/xorg/Xorg           
                  4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    1
              N/A  N/A      2179      G   /usr/lib/xorg/Xorg           
                  4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    2
              N/A  N/A      2179      G   /usr/lib/xorg/Xorg           
                  4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">|    3
              N/A  N/A      2179      G   /usr/lib/xorg/Xorg           
                  4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
            style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+</span><span
            style="font-size:11.0pt"><o:p></o:p></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt"> <o:p></o:p></span></p>
      </div>
      <br>
      <br>
      <p style="font-family: Helvetica; font-size:13.333px;
        color:#666666;">This message and any attachments are
        Confidential Information, for the exclusive use of the addressee
        and may be legally privileged. Any receipt by anyone other than
        the intended addressee does not constitute a loss of the
        confidential or privileged nature of the communication. Any
        other distribution, use or reproduction is unauthorized and
        prohibited. If you are not the intended recipient, please
        contact the sender by return electronic mail and delete all
        copies of this communication</p>
    </blockquote>
    <p><br>
    </p>
    <pre class="moz-signature" cols="72">--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich

Phone +41 44 632 30 59
<a class="moz-txt-link-abbreviated" href="mailto:stephan.roth@ee.ethz.ch">stephan.roth@ee.ethz.ch</a>
<a class="moz-txt-link-abbreviated" href="http://www.isg.ee.ethz.ch">www.isg.ee.ethz.ch</a>

Working days: Mon,Tue,Thu,Fri</pre>
  </body>
</html>