[slurm-users] Suddenly getting "Invalid node name specified" when attempting srun/sbatch
Benjamin Wong
bwong at keiserlab.org
Thu Jul 11 01:10:22 UTC 2019
My server was having issues yesterday so I rebooted it last night but slurm
has not been working properly ever since the reboot. I've rebooted other
machines too in the same time and they work completely fine but this one in
particular cannot submit any srun/sbatch commands due to a "invalid node
name" error. I don't see anything wrong with what I'm doing and DNS is
working completely fine.
# on slurmd node
[bwong1 at mk-gpu-2 ~]$ srun /bin/hostname
srun: error: Unable to allocate resources: Invalid node name specified
# from slurmctld
[root at mk-slurm slurm]# ping mk-gpu-2
PING mk-gpu-2.c.keiserlab.org (10.10.100.109) 56(84) bytes of data.
64 bytes from mk-gpu-2.c.keiserlab.org (10.10.100.109)
# on slurmctld.log, (19015 is my UID)
slurmctld: error: slurm_auth_get_host: Lookup failed: Unknown host
slurmctld: error: REQUEST_RESOURCE_ALLOCATE lacks alloc_node from uid=19015
slurmctld: _slurm_rpc_allocate_resources: Invalid node name specified
slurmctld: error: slurm_auth_get_host: Lookup failed: Unknown host
slurmctld: error: REQUEST_RESOURCE_ALLOCATE lacks alloc_node from uid=19015
slurmctld: _slurm_rpc_allocate_resources: Invalid node name specified
# relevant portions of slurm.conf
NodeName=mk-gpu-2 NodeAddr=10.10.100.109 RealMemory=750000 Gres=gpu:8
Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=all.q Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Any ideas for what's causing this "unknown host" error? I have the proper
hostname and IP address in the slurm.conf so I'm not sure what else is
going on.
Thanks,
Benjamin Wong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190710/3ea0ea38/attachment.htm>
More information about the slurm-users
mailing list