[slurm-users] gres/gpu: count changed for node node002 from 0 to 1

Fri Mar 13 15:36:08 UTC 2020

We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps
going into a draining state:
 sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1   drng node002

info -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E"
            NODELIST   CPUS(A/I/O/T)      STATE     MEMORY       PARTITION
           GRES                              REASON
             node001       9/15/0/24        mix     191800           defq*
          gpu:1                                none
             node002       1/0/23/24       drng     191800           defq*
          gpu:1 gres/gpu count changed and jobs are
             node003       1/23/0/24        mix     191800           defq*
          gpu:1                                none

Node of the nodes have a separate slurm.conf file, it's all shared from the
head node. What else could be causing this?

[2020-03-13T07:14:28.590] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T07:14:28.590] error: _slurm_rpc_node_registration
node=node002: Invalid
argument
[2020-03-13T07:14:28.590] error: Node node001 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set  DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-03-13T07:14:28.590] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.787] error: Node node001 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.787] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.788] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T07:47:48.788] error: _slurm_rpc_node_registration node=node002:
Invalid argument [2020-03-13T08:21:08.057] error: Node node001 appears to
have a different slurm.conf than the slurmctld. This could cause issues
with communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T08:21:08.058] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T08:21:08.058] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T08:21:08.058] error: _slurm_rpc_node_registration
node=node002: Invalid
argument
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200313/7ecd3709/attachment-0001.htm>