<div dir="ltr">Thanks for the suggestion; if my memory serves me right, I had to do that previously to cause the drivers to load correctly after boot.<div><br></div><div>However, in this case both 'nvidia-smi' and 'nvidia-smi -L' run just fine and produce expected output.</div><div><br></div><div>One thing I see that is different is my older nodes have these two "uvm" devices:</div><div><br></div><div>working:</div><div><div>[root@n0035 ~]# ls -alhtr /dev/nvidia*</div><div>crw-rw-rw- 1 root root 195, 255 Nov 6 2017 /dev/nvidiactl</div><div>crw-rw-rw- 1 root root 195, 0 Nov 6 2017 /dev/nvidia0</div><div>crw-rw-rw- 1 root root 195, 1 Nov 6 2017 /dev/nvidia1</div><div>crw-rw-rw- 1 root root 195, 2 Nov 6 2017 /dev/nvidia2</div><div>crw-rw-rw- 1 root root 195, 3 Nov 6 2017 /dev/nvidia3</div><div>crw-rw-rw- 1 root root 241, 1 Nov 7 2017 /dev/nvidia-uvm-tools</div><div>crw-rw-rw- 1 root root 241, 0 Nov 7 2017 /dev/nvidia-uvm</div></div><div><br></div><div>not working:</div><div><div>[root@n0039 ~]# ls -alhtr /dev/nvidia*</div><div>crw-rw-rw- 1 root root 195, 255 Jul 12 17:09 /dev/nvidiactl</div><div>crw-rw-rw- 1 root root 195, 0 Jul 12 17:09 /dev/nvidia0</div><div>crw-rw-rw- 1 root root 195, 1 Jul 12 17:09 /dev/nvidia1</div><div>crw-rw-rw- 1 root root 195, 2 Jul 12 17:09 /dev/nvidia2</div><div>crw-rw-rw- 1 root root 195, 3 Jul 12 17:09 /dev/nvidia3</div></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Mon, Jul 23, 2018 at 3:41 PM Bill <<a href="mailto:bill@simplehpc.com">bill@simplehpc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="min-height:22px;margin-bottom:8px">Hi Alex,</div><div style="min-height:22px;margin-bottom:8px"><br></div><div style="min-height:22px;margin-bottom:8px">Try run nvidia-smi before start slurmd, I also found this issue. I have to run nvidia-smi before slurmd when I reboot system.</div><div style="min-height:22px;margin-bottom:8px">Regards,</div><div style="min-height:22px;margin-bottom:8px">Bill</div><div id="m_-2665360825808841766original-content"><br><br><div><div style="font-size:70%;padding:2px 0">------------------ Original ------------------</div><div style="font-size:70%;background:#f0f0f0;color:#212121;padding:8px;border-radius:4px"><div><b>From:</b> Alex Chekholko <<a href="mailto:alex@calicolabs.com" target="_blank">alex@calicolabs.com</a>></div><div><b>Date:</b> Tue,Jul 24,2018 6:10 AM</div><div><b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>></div><div><b>Subject:</b> Re: [slurm-users] "fatal: can't stat gres.conf"</div></div></div><br><div dir="ltr">Hi all,<div><br></div><div>I have a few working GPU compute nodes. I bought a couple of more identical nodes. They are all diskless; so they all boot from the same disk image.</div><div><br></div><div>For some reason slurmd refuses to start on the new nodes; and I'm not able to find any differences in hardware or software. Google searches for "error: Waiting for gres.conf file " or "fatal: can't stat gres.conf file" are not helping.</div><div><br></div><div>The gres.conf file is there and identical on all nodes. The /dev/nvidia[0-3] files are there and 'nvidia-smi -L' works fine. What am I missing?</div><div><br></div><div><br></div><div><div>[root@n0038 ~]# slurmd -Dcvvv</div><div>slurmd: debug2: hwloc_topology_init</div><div>slurmd: debug2: hwloc_topology_load</div><div>slurmd: debug: CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10 ThreadsPerCore:1</div><div>slurmd: Node configuration differs from hardware: CPUs=16:20(hw) Boards=1:1(hw) SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw) ThreadsPerCore=1:1(hw)</div><div>slurmd: Message aggregation disabled</div><div>slurmd: debug: init: Gres GPU plugin loaded</div><div>slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9"</div><div>slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No such file or directory</div></div><div><br></div><div><br></div><div><br></div><div>SLURM version ohpc-17.02.7-61</div><div><br></div></div>
</div></blockquote></div>