[slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
Steve Bland
sbland at rossvideo.com
Fri Nov 27 16:18:05 UTC 2020
Andy
I appreciate you making me check again, things do get missed
SELinux is off, firewalld is disabled
[root at SRVGRIDSLURM01 ~]# sestatus
SELinux status: disabled
[root at SRVGRIDSLURM01 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
The one thing I can think of is that the system running slurmctld has two network interfaces. It serves as a gateway, so has two network address. The two of the test slurmd's are on the other side of that gateway box, one is on the same box. But the two on the other side of the gateway, have a different IP address range and possibly mask
this is from slurm.conf for the three nodes. I know they are talking; I can see it in the logs when set to a debug logging level
the nodename info comes from slurmd -C, so that is correct
added the IP address, but that did not matter
# COMPUTE NODES
NodeName=SRVGRIDSLURM01 NodeAddr=192.168.1.60 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
NodeName=SRVGRIDSLURM02 NodeAddr=192.168.1.61 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
NodeName=srvgridslurm03 NodeAddr=192.168.1.62 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
about the only thing I can think of is to make one of the nodes on the otherside of the gateway into the control node
Steve Bland
Technical Product Manager
Third Party Products
Ross Video | Production Technology Experts
T: +1 (613) 228-0688 ext.4219
www.rossvideo.com<http://www.rossvideo.com/>
________________________________
From: Andy Riebs <andy.riebs at gmail.com> on behalf of Andy Riebs <andy at candooz.com>
Sent: 26 November 2020 13:40
To: Steve Bland <sbland at rossvideo.com>; Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
One last shot on the firewall front Steve -- does the control node have a firewall enabled? I've seen cases where that can cause the sporadic messaging failures that you seem to be seeing.
That failing, I'll defer to anyone with different ideas!
Andy
On 11/26/2020 1:01 PM, Steve Bland wrote:
----------------------------------------------
This e-mail and any attachments may contain information that is confidential to Ross Video.
If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201127/244fd93b/attachment.htm>
More information about the slurm-users
mailing list