[slurm-users] Jobs fail on specific nodes.

Roger Mason rmason at mun.ca
Tue May 24 14:50:49 UTC 2022


Hello,

I have a small cluster of 4 nodes.  I'm seeing jobs fail on two nodes
with this written to slurm-*.out:

less 1x1x1_220524_121358/slurm-1368_1.out 
srun: error: Unable to resolve "node012": Unknown server error
srun: error: fwd_tree_thread: can't find address for host node012, check slurm.conf
srun: error: Task launch for 1368.0 failed on node node012: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

The same job runs correctly on either of two other nodes.

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
macpro*      up   infinite      1   idle node012 
macpro*      up   infinite      3   down node[001-002,004]

I can ssh into node012 and the above sinfo suggests no communication
problems.  I have not modified slurm.conf recently.

I would appreciate any suggestions on what might be causing this problem
or what I can do to diagnose it.

Thanks,
Roger



More information about the slurm-users mailing list