[slurm-users] "fatal: can't stat gres.conf"

Sean Crosby richardnixonshead at gmail.com
Mon Jul 23 19:42:28 MDT 2018


Hi Alex,

What's the actual content of your gres.conf file? Seems to me that you have
a trailing comma after the location of the nvidia device

Our gres.conf has

NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia0
Cores=0,2,4,6,8,10,12,14,16,18,20,22
NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia1
Cores=0,2,4,6,8,10,12,14,16,18,20,22
NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia2
Cores=1,3,5,7,9,11,13,15,17,19,21,23
NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia3
Cores=1,3,5,7,9,11,13,15,17,19,21,23

I think you have a comma between the File and Cores/CPUs

Sean


On Tue, 24 Jul 2018 at 08:13, Alex Chekholko <alex at calicolabs.com> wrote:

> Hi all,
>
> I have a few working GPU compute nodes.  I bought a couple of more
> identical nodes.  They are all diskless; so they all boot from the same
> disk image.
>
> For some reason slurmd refuses to start on the new nodes; and I'm not able
> to find any differences in hardware or software.  Google searches for
> "error: Waiting for gres.conf file " or "fatal: can't stat gres.conf file"
> are not helping.
>
> The gres.conf file is there and identical on all nodes. The
> /dev/nvidia[0-3] files are there and 'nvidia-smi -L' works fine.  What am I
> missing?
>
>
> [root at n0038 ~]# slurmd -Dcvvv
> slurmd: debug2: hwloc_topology_init
> slurmd: debug2: hwloc_topology_load
> slurmd: debug:  CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10
> ThreadsPerCore:1
> slurmd: Node configuration differs from hardware: CPUs=16:20(hw)
> Boards=1:1(hw) SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw)
> ThreadsPerCore=1:1(hw)
> slurmd: Message aggregation disabled
> slurmd: debug:  init: Gres GPU plugin loaded
> slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9"
> slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No
> such file or directory
>
>
>
> SLURM version ohpc-17.02.7-61
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180724/8f13d6e7/attachment.html>


More information about the slurm-users mailing list