<div dir="ltr">I have found that the "reason" field doesn't get updated after you correct the issue.  For me, its only when I move the node back to the idle state, that the reason field is then reset.  So, assuming /dev/nvidia[0-3] is correct (I've never seen otherwise with nvidia GPUs), then try taking them back into the idle state.  Also, keep an eye on the slurmctld and slurmd logs.  They usually are quite helpful to highlight what the issue is.<div><br></div><div>David</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 3, 2022 at 11:50 AM Jim Kavitsky <<a href="mailto:JimKavitsky@lucidmotors.com">JimKavitsky@lucidmotors.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">



<div lang="EN-US" style="overflow-wrap: break-word;">
<div class="gmail-m_-9199097329920133105WordSection1">
<p class="MsoNormal"><span style="font-size:10.5pt">Hello Fellow Slurm Admins,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10.5pt"><u></u> <u></u></span></p>
<p class="gmail-m_-9199097329920133105p1">I have a new Slurm installation that was working and running basic test jobs until I added gpu support. My worker nodes are now all in drain state, with
<span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:10.5pt"><u></u> <u></u></span></p>
<p class="MsoNormal">This is in spite of the fact that nvidia-smi reports all four A100’s as active on each node. I have spent a good chunk of a week googling around for the solution to this, and trying variants of the gpu config lines/restarting daemons without
 any luck. <u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">The relevant lines from my current config files are below. The head node and all workers have the same gres.conf and slurm.conf files. Can anyone suggest anything else I should be looking at or adding? I’m guessing that this is a problem
 that many have faced, and any guidance would be greatly appreciated.<u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:10.5pt"><u></u> <u></u></span></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1">root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1">GresTypes=</span><span class="gmail-m_-9199097329920133105s2"><b>gpu</b></span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1">NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=</span><span class="gmail-m_-9199097329920133105s2"><b>gpu</b></span><span class="gmail-m_-9199097329920133105s1">:4 State=UNKNOWN<u></u><u></u></span></p>
<p class="gmail-m_-9199097329920133105p1"><u></u> <u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1">root@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1">NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]<u></u><u></u></span></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1"><u></u> <u></u></span></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1"><u></u> <u></u></span></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1"><u></u> <u></u></span></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105s1">root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.75E"</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">            </span><span class="gmail-m_-9199097329920133105s1">NODELIST
</span><span class="gmail-m_-9199097329920133105apple-converted-space">  </span><span class="gmail-m_-9199097329920133105s1">CPUS(A/I/O/T)</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">STATE </span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">MEMORY </span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">PARTITION</span><span class="gmail-m_-9199097329920133105apple-converted-space">           
</span><span class="gmail-m_-9199097329920133105s1">GRES</span><span class="gmail-m_-9199097329920133105apple-converted-space">                                                                     
</span><span class="gmail-m_-9199097329920133105s1">REASON</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp01
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp02
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp03
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp04
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp05
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp06
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp07
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><span class="gmail-m_-9199097329920133105apple-converted-space">     </span><span class="gmail-m_-9199097329920133105s1">sjc01enadsapp08
</span><span class="gmail-m_-9199097329920133105apple-converted-space">      </span><span class="gmail-m_-9199097329920133105s1">0/0/64/64</span><span class="gmail-m_-9199097329920133105apple-converted-space">     
</span><span class="gmail-m_-9199097329920133105s1">drain</span><span class="gmail-m_-9199097329920133105apple-converted-space">    </span>
<span class="gmail-m_-9199097329920133105s1">2063731</span><span class="gmail-m_-9199097329920133105apple-converted-space">        </span>
<span class="gmail-m_-9199097329920133105s1">Primary* </span><span class="gmail-m_-9199097329920133105apple-converted-space">          </span>
<span class="gmail-m_-9199097329920133105s1">gpu:4 </span><span class="gmail-m_-9199097329920133105apple-converted-space">                     
</span><span class="gmail-m_-9199097329920133105s1">gres/gpu count reported lower than configured (0 < 4)</span><u></u><u></u></p>
<p class="gmail-m_-9199097329920133105p1"><u></u> <u></u></p>
<p class="gmail-m_-9199097329920133105p1"><u></u> <u></u></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">root@sjc01enadsapp07:~# nvidia-smi<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">Tue May  3 18:41:34 2022       <u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|-------------------------------+----------------------+----------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|                               |                      |               MIG M. |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|===============================+======================+======================|<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|   0  NVIDIA A100-PCI...  On   | 00000000:17:00.0 Off |                    0 |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| N/A   42C    P0    49W / 250W |      4MiB / 40536MiB |      0%      Default |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|                               |                      |             Disabled |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|   1  NVIDIA A100-PCI...  On   | 00000000:65:00.0 Off |                    0 |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| N/A   41C    P0    48W / 250W |      4MiB / 40536MiB |      0%      Default |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|                               |                      |             Disabled |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|   2  NVIDIA A100-PCI...  On   | 00000000:CA:00.0 Off |                    0 |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| N/A   35C    P0    44W / 250W |      4MiB / 40536MiB |      0%      Default |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|                               |                      |             Disabled |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|   3  NVIDIA A100-PCI...  On   | 00000000:E3:00.0 Off |                    0 |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| N/A   38C    P0    45W / 250W |      4MiB / 40536MiB |      0%      Default |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|                               |                      |             Disabled |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">                                                                               <u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">| Processes:                                                                  |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|        ID   ID                                                   Usage      |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|=============================================================================|<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|    0   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|    1   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|    2   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">|    3   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+<u></u><u></u></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
</div>


<br><br><p style="font-family:Verdana;font-size:10pt;color:rgb(102,102,102)"></p><p style="font-family:Helvetica;font-size:13.333px;color:rgb(102,102,102)">This message and any attachments are Confidential Information, for the exclusive use of the addressee and may be legally privileged. Any receipt by anyone other than the intended addressee does not constitute a loss of the confidential or privileged nature of the communication. Any other distribution, use or reproduction is unauthorized and prohibited. If you are not the intended recipient, please contact the sender by return electronic mail and delete all copies of this communication</p></div>
</blockquote></div>