[slurm-users] Jobs fail on specific nodes.
    Roger Mason 
    rmason at mun.ca
       
    Tue May 24 14:50:49 UTC 2022
    
    
  
Hello,
I have a small cluster of 4 nodes.  I'm seeing jobs fail on two nodes
with this written to slurm-*.out:
less 1x1x1_220524_121358/slurm-1368_1.out 
srun: error: Unable to resolve "node012": Unknown server error
srun: error: fwd_tree_thread: can't find address for host node012, check slurm.conf
srun: error: Task launch for 1368.0 failed on node node012: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
The same job runs correctly on either of two other nodes.
sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
macpro*      up   infinite      1   idle node012 
macpro*      up   infinite      3   down node[001-002,004]
I can ssh into node012 and the above sinfo suggests no communication
problems.  I have not modified slurm.conf recently.
I would appreciate any suggestions on what might be causing this problem
or what I can do to diagnose it.
Thanks,
Roger
    
    
More information about the slurm-users
mailing list