[slurm-users] "fatal: can't stat gres.conf"
    Alex Chekholko 
    alex at calicolabs.com
       
    Mon Jul 23 16:07:28 MDT 2018
    
    
  
Hi all,
I have a few working GPU compute nodes.  I bought a couple of more
identical nodes.  They are all diskless; so they all boot from the same
disk image.
For some reason slurmd refuses to start on the new nodes; and I'm not able
to find any differences in hardware or software.  Google searches for
"error: Waiting for gres.conf file " or "fatal: can't stat gres.conf file"
are not helping.
The gres.conf file is there and identical on all nodes. The
/dev/nvidia[0-3] files are there and 'nvidia-smi -L' works fine.  What am I
missing?
[root at n0038 ~]# slurmd -Dcvvv
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10
ThreadsPerCore:1
slurmd: Node configuration differs from hardware: CPUs=16:20(hw)
Boards=1:1(hw) SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw)
ThreadsPerCore=1:1(hw)
slurmd: Message aggregation disabled
slurmd: debug:  init: Gres GPU plugin loaded
slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9"
slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No
such file or directory
SLURM version ohpc-17.02.7-61
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180723/5ad9cd24/attachment.html>
    
    
More information about the slurm-users
mailing list