[slurm-users] Jobs fail on specific nodes.
Roger Mason
rmason at mun.ca
Tue May 24 14:50:49 UTC 2022
Hello,
I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes
with this written to slurm-*.out:
less 1x1x1_220524_121358/slurm-1368_1.out
srun: error: Unable to resolve "node012": Unknown server error
srun: error: fwd_tree_thread: can't find address for host node012, check slurm.conf
srun: error: Task launch for 1368.0 failed on node node012: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
The same job runs correctly on either of two other nodes.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
macpro* up infinite 1 idle node012
macpro* up infinite 3 down node[001-002,004]
I can ssh into node012 and the above sinfo suggests no communication
problems. I have not modified slurm.conf recently.
I would appreciate any suggestions on what might be causing this problem
or what I can do to diagnose it.
Thanks,
Roger
More information about the slurm-users
mailing list