[slurm-users] srun problem -- Can't find an address, check slurm.conf

Paul Edmon pedmon at cfa.harvard.edu
Wed Nov 7 07:57:36 MST 2018


This smacks of either the submission host, the destination host, or the 
master not being able to resolve the name to an IP.  I would triple 
check that to ensure that resolution is working.

-Paul Edmon-

On 11/7/18 8:33 AM, Scott Hazelhurst wrote:
>
> Dear list
>
> We have a relatively new installation of SLURM. We have started to have a problem with some of the nodes when using srun
>
> [scott at cream-ce ~]$ srun --pty -w n38 hostname
> srun: error: fwd_tree_thread: can't find address for host n38, check slurm.conf
> srun: error: Task launch for 18710.0 failed on node n38: Can't find an address, check slurm.conf
> srun: error: Application launch failed: Can't find an address, check slurm.conf
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
>
> I’ve spent most of the day trying to follow others who’ve had similar problems and check everything but I haven’t made any progress
>
> — Using sbatch there is no problem: jobs launch on the given node and finish normally and reliably
>
> — srun works fine for most of the nodes
>
> — the slum.conf file is identical on all nodes (checked by diffing and no complaints in the logs)
>
> — both the slurmctld and the slurmd start cleanly with no obvious errors or warnings (e.g. about slurm.conf)
>
> — sinfo reports that all our nodes are up, some busy some not. The problem is independent of load on the nodes
>
> — I’ve increased the log level on the control daemon and there’s no obvious additional information when the srun happens
>
> — we use puppet to maintain our infrastructure so while there must be a difference between the  machines that work and those that don’t I can see it.
>
> — all nodes run ntpd and the times appear the same when checked manually
>
> — all nodes have plenty of disk space
>
> — I’ve tried restarting both the slurm and control daemons and this has no effect, even for a short time
>
> — hostname on working and problematic nodes give the expected results in the same format as other
>
> — all hostnames are in /etc/hosts on all machines
>
> — we currently have just less than 40 worker nodes, treewidth=50
>
>
> We’re running SLURM 17.11.10 under CentOS 7.5
>
>
> This is the final part of the slurm.conf file
>
> NodeName=n[02,08,10,29-40,42-45] RealMemory=131072 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[05] RealMemory=256000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[07] RealMemory=45000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[15] RealMemory=48000 Sockets=20 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> NodeName=n[16] RealMemory=31000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[17] RealMemory=215000 Sockets=16 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
> NodeName=n[18] RealMemory=90000 Sockets=20 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> NodeName=n[19] RealMemory=515000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[20] RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[21] RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[22] RealMemory=56000 Sockets=16 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> NodeName=n[23] RealMemory=225500 Sockets=20 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> NodeName=n[27] RealMemory=65536 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
> NodeName=n[28] RealMemory=65536 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
>
> PartitionName=batch Nodes=n[02,05,07-08,10,15-23,27-40,42-45] Default=YES MaxTime=40320 State=UP
>
> As examples
> — fails: n15, n17,  n27, n28, n38, n45
> — success: n02, n10, n16, n18, n29 onwards except for n38, n45
>
>
> Many thanks for any help
>
> Scott
>
>
> This communication is intended for the addressee only. It is confidential. If you have received this communication in error, please notify us immediately and destroy the original message. You may not copy or disseminate this communication without the permission of the University. Only authorised signatories are competent to enter into agreements on behalf of the University and recipients are thus advised that the content of this message may not be legally binding on the University and may contain the personal views and opinions of the author, which are not necessarily the views and opinions of The University of the Witwatersrand, Johannesburg. All agreements between the University and outsiders are subject to South African Law unless the University agrees in writing to the contrary.



More information about the slurm-users mailing list