[slurm-users] gres/gpu: count changed for node node002 from 0 to 1
Robert Kudyba
rkudyba at fordham.edu
Fri Mar 13 15:36:08 UTC 2020
We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps
going into a draining state:
sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 drng node002
info -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E"
NODELIST CPUS(A/I/O/T) STATE MEMORY PARTITION
GRES REASON
node001 9/15/0/24 mix 191800 defq*
gpu:1 none
node002 1/0/23/24 drng 191800 defq*
gpu:1 gres/gpu count changed and jobs are
node003 1/23/0/24 mix 191800 defq*
gpu:1 none
Node of the nodes have a separate slurm.conf file, it's all shared from the
head node. What else could be causing this?
[2020-03-13T07:14:28.590] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T07:14:28.590] error: _slurm_rpc_node_registration
node=node002: Invalid
argument
[2020-03-13T07:14:28.590] error: Node node001 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-03-13T07:14:28.590] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.787] error: Node node001 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.787] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.788] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T07:47:48.788] error: _slurm_rpc_node_registration node=node002:
Invalid argument [2020-03-13T08:21:08.057] error: Node node001 appears to
have a different slurm.conf than the slurmctld. This could cause issues
with communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T08:21:08.058] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T08:21:08.058] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T08:21:08.058] error: _slurm_rpc_node_registration
node=node002: Invalid
argument
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200313/7ecd3709/attachment-0001.htm>
More information about the slurm-users
mailing list