<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Hi Jim,<br>
<br>
I don't know if it makes a difference, but I only ever use the
complete numeric suffix within brackets, as in<br>
<pre>sjc01enadsapp[01-08]</pre>
Otherwise I'd raise the debug level of slurmd to maximum by
setting<br>
<pre>SlurmdDebug=debug5</pre>
in <i>slurm.conf</i>, tail <i>SlurmdLogFile</i> on a GPU node
and then restart <i>slurmd</i> there.<br>
This might shed some light on what goes wrong.<br>
<br>
Cheers,<br>
Stephan</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 03.05.22 20:51, Jim Kavitsky wrote:<br>
</div>
<blockquote type="cite"
cite="mid:PH0PR18MB49224AB867E719611201C539B4C09@PH0PR18MB4922.namprd18.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style>@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
{font-family:Menlo;
panose-1:2 11 6 9 3 8 4 2 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}p.p1, li.p1, div.p1
{mso-style-name:p1;
margin:0in;
font-size:10.5pt;
font-family:Menlo;
color:black;}span.s1
{mso-style-name:s1;}span.s2
{mso-style-name:s2;
color:#B42419;}span.apple-converted-space
{mso-style-name:apple-converted-space;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}div.WordSection1
{page:WordSection1;}</style>
<style type="text/css">.style1 {font-family: "Times New Roman";}</style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Whoops. Sent
the first to an incorrect address….apologies if this shows
up as a duplicate.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-jimk<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span
style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">Jim
Kavitsky <a class="moz-txt-link-rfc2396E" href="mailto:JimKavitsky@lucidmotors.com"><JimKavitsky@lucidmotors.com></a><br>
<b>Date: </b>Tuesday, May 3, 2022 at 11:46 AM<br>
<b>To: </b><a class="moz-txt-link-abbreviated" href="mailto:slurm-users@schedmd.com">slurm-users@schedmd.com</a>
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@schedmd.com"><slurm-users@schedmd.com></a><br>
<b>Subject: </b>gres/gpu count lower than reported<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:10.5pt">Hello Fellow
Slurm Admins,</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.5pt"> </span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="p1">I have a new Slurm installation that was working
and running basic test jobs until I added gpu support. My
worker nodes are now all in drain state, with
<span class="s1">gres/gpu count reported lower than configured
(0 < 4)</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:10.5pt"> </span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">This is in
spite of the fact that nvidia-smi reports all four A100’s as
active on each node. I have spent a good chunk of a week
googling around for the solution to this, and trying
variants of the gpu config lines/restarting daemons without
any luck. <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">The relevant
lines from my current config files are below. The head node
and all workers have the same gres.conf and slurm.conf
files. Can anyone suggest anything else I should be looking
at or adding? I’m guessing that this is a problem that many
have faced, and any guidance would be greatly appreciated.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.5pt"> </span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="p1"><span class="s1">root@sjc01enadsapp00:/etc/slurm-llnl#
grep gpu slurm.conf</span><o:p></o:p></p>
<p class="p1"><span class="s1">GresTypes=</span><span class="s2"><b>gpu</b></span><o:p></o:p></p>
<p class="p1"><span class="s1">NodeName=sjc01enadsapp0[1-8]
RealMemory=2063731 Sockets=2 CoresPerSocket=16
ThreadsPerCore=2 Gres=</span><span class="s2"><b>gpu</b></span><span
class="s1">:4 State=UNKNOWN</span><o:p></o:p></p>
<p class="p1"> <o:p></o:p></p>
<p class="p1"><span class="s1">root@sjc01enadsapp00:/etc/slurm-llnl#
cat gres.conf</span><o:p></o:p></p>
<p class="p1"><span class="s1">NodeName=sjc01enadsapp0[1-8]
Name=gpu File=/dev/nvidia[0-3]</span><o:p></o:p></p>
<p class="p1"><span class="s1"> </span><o:p></o:p></p>
<p class="p1"><span class="s1"> </span><o:p></o:p></p>
<p class="p1"><span class="s1"> </span><o:p></o:p></p>
<p class="p1"><span class="s1">root@sjc01enadsapp00:~# sinfo -N
-o "%.20N %.15C %.10t %.10m %.15P %.15G %.75E"</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">NODELIST
</span><span class="apple-converted-space"> </span><span
class="s1">CPUS(A/I/O/T)</span><span
class="apple-converted-space">
</span><span class="s1">STATE </span><span
class="apple-converted-space"> </span>
<span class="s1">MEMORY </span><span
class="apple-converted-space"> </span><span class="s1">PARTITION</span><span
class="apple-converted-space">
</span><span class="s1">GRES</span><span
class="apple-converted-space">
</span><span class="s1">REASON</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp01
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp02
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp03
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp04
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp05
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp06
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp07
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"><span class="apple-converted-space"> </span><span
class="s1">sjc01enadsapp08
</span><span class="apple-converted-space"> </span><span
class="s1">0/0/64/64</span><span
class="apple-converted-space">
</span><span class="s1">drain</span><span
class="apple-converted-space"> </span>
<span class="s1">2063731</span><span
class="apple-converted-space"> </span>
<span class="s1">Primary* </span><span
class="apple-converted-space"> </span>
<span class="s1">gpu:4 </span><span
class="apple-converted-space">
</span><span class="s1">gres/gpu count reported lower than
configured (0 < 4)</span><o:p></o:p></p>
<p class="p1"> <o:p></o:p></p>
<p class="p1"> <o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">root@sjc01enadsapp07:~#
nvidia-smi</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">Tue
May 3 18:41:34 2022 </span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA
Version: 11.4 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|-------------------------------+----------------------+----------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| GPU
Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| Fan
Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
| |
MIG M. |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|===============================+======================+======================|</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 0
NVIDIA A100-PCI... On | 00000000:17:00.0 Off |
0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
42C P0 49W / 250W | 4MiB / 40536MiB | 0%
Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
| |
Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 1
NVIDIA A100-PCI... On | 00000000:65:00.0 Off |
0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
41C P0 48W / 250W | 4MiB / 40536MiB | 0%
Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
| |
Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 2
NVIDIA A100-PCI... On | 00000000:CA:00.0 Off |
0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
35C P0 44W / 250W | 4MiB / 40536MiB | 0%
Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
| |
Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 3
NVIDIA A100-PCI... On | 00000000:E3:00.0 Off |
0 |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| N/A
38C P0 45W / 250W | 4MiB / 40536MiB | 0%
Default |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
| |
Disabled |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-------------------------------+----------------------+----------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">
</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
Processes:
|</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| GPU
GI CI PID Type Process name
GPU Memory |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|
ID ID
Usage |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">|=============================================================================|</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 0
N/A N/A 2179 G /usr/lib/xorg/Xorg
4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 1
N/A N/A 2179 G /usr/lib/xorg/Xorg
4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 2
N/A N/A 2179 G /usr/lib/xorg/Xorg
4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">| 3
N/A N/A 2179 G /usr/lib/xorg/Xorg
4MiB |</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:8.5pt;font-family:Menlo;color:black">+-----------------------------------------------------------------------------+</span><span
style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> <o:p></o:p></span></p>
</div>
<br>
<br>
<p style="font-family: Helvetica; font-size:13.333px;
color:#666666;">This message and any attachments are
Confidential Information, for the exclusive use of the addressee
and may be legally privileged. Any receipt by anyone other than
the intended addressee does not constitute a loss of the
confidential or privileged nature of the communication. Any
other distribution, use or reproduction is unauthorized and
prohibited. If you are not the intended recipient, please
contact the sender by return electronic mail and delete all
copies of this communication</p>
</blockquote>
<p><br>
</p>
<pre class="moz-signature" cols="72">--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich
Phone +41 44 632 30 59
<a class="moz-txt-link-abbreviated" href="mailto:stephan.roth@ee.ethz.ch">stephan.roth@ee.ethz.ch</a>
<a class="moz-txt-link-abbreviated" href="http://www.isg.ee.ethz.ch">www.isg.ee.ethz.ch</a>
Working days: Mon,Tue,Thu,Fri</pre>
</body>
</html>