[slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Steve Bland sbland at rossvideo.com
Fri Nov 27 16:18:05 UTC 2020


Andy

I appreciate you making me check again, things do get missed

SELinux is off, firewalld is disabled


[root at SRVGRIDSLURM01 ~]# sestatus

SELinux status:                 disabled

[root at SRVGRIDSLURM01 ~]# systemctl status firewalld

● firewalld.service - firewalld - dynamic firewall daemon

   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)

   Active: inactive (dead)

     Docs: man:firewalld(1)

The one thing I can think of is that the system running  slurmctld has two network interfaces. It serves as a gateway, so has two network address. The two of the test slurmd's are on the other side of that gateway box, one is on the same box. But the two on the other side of the gateway, have a different IP address range and possibly mask

this is from slurm.conf for the three nodes. I know they are talking; I can see it in the logs when set to a debug logging level
the nodename info comes from slurmd -C, so that is correct
added the IP address, but that did not matter


# COMPUTE NODES

NodeName=SRVGRIDSLURM01 NodeAddr=192.168.1.60 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821

NodeName=SRVGRIDSLURM02 NodeAddr=192.168.1.61 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821

NodeName=srvgridslurm03 NodeAddr=192.168.1.62 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821

PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

about the only thing I can think of is to make one of the nodes on the otherside of the gateway into the control node



Steve Bland
Technical Product Manager

Third Party Products
Ross Video | Production Technology Experts
T: +1 (613) 228-0688 ext.4219
www.rossvideo.com<http://www.rossvideo.com/>

________________________________
From: Andy Riebs <andy.riebs at gmail.com> on behalf of Andy Riebs <andy at candooz.com>
Sent: 26 November 2020 13:40
To: Steve Bland <sbland at rossvideo.com>; Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes


One last shot on the firewall front Steve -- does the control node have a firewall enabled? I've seen cases where that can cause the sporadic messaging failures that you seem to be seeing.

That failing, I'll defer to anyone with different ideas!

Andy

On 11/26/2020 1:01 PM, Steve Bland wrote:
----------------------------------------------

This e-mail and any attachments may contain information that is confidential to Ross Video.

If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201127/244fd93b/attachment.htm>


More information about the slurm-users mailing list