[slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Andy Riebs andy at candooz.com
Fri Nov 27 16:43:53 UTC 2020


Steve, you've exhausted my best ideas... hope someone else can jump in!

Andy

On Fri, Nov 27, 2020, 11:19 AM Steve Bland <sbland at rossvideo.com> wrote:

>
> Andy
>
> I appreciate you making me check again, things do get missed
>
> SELinux is off, firewalld is disabled
>
> [root at SRVGRIDSLURM01 ~]# sestatus
>
> SELinux status:                 disabled
>
> [root at SRVGRIDSLURM01 ~]# systemctl status firewalld
>
> ● firewalld.service - firewalld - dynamic firewall daemon
>
>    Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled;
> vendor preset: enabled)
>
>    Active: inactive (dead)
>
>      Docs: man:firewalld(1)
>
> The one thing I can think of is that the system running  slurmctld has two
> network interfaces. It serves as a gateway, so has two network address. The
> two of the test slurmd's are on the other side of that gateway box, one is
> on the same box. But the two on the other side of the gateway, have a
> different IP address range and possibly mask
>
> this is from slurm.conf for the three nodes. I know they are talking; I
> can see it in the logs when set to a debug logging level
> the nodename info comes from slurmd -C, so that is correct
> added the IP address, but that did not matter
>
> # COMPUTE NODES
>
> NodeName=SRVGRIDSLURM01 NodeAddr=192.168.1.60 CPUs=4 Boards=1
> SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
>
> NodeName=SRVGRIDSLURM02 NodeAddr=192.168.1.61 CPUs=4 Boards=1
> SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
>
> NodeName=srvgridslurm03 NodeAddr=192.168.1.62 CPUs=4 Boards=1
> SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
>
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> about the only thing I can think of is to make one of the nodes on the
> otherside of the gateway into the control node
>
>
> *Steve Bland*
> *Technical Product Manager*
>
> *Third Party Products*
> Ross Video | Production Technology Experts
> T: +1 (613) 228-0688 ext.4219
> www.rossvideo.com
> ------------------------------
> *From:* Andy Riebs <andy.riebs at gmail.com> on behalf of Andy Riebs <
> andy at candooz.com>
> *Sent:* 26 November 2020 13:40
> *To:* Steve Bland <sbland at rossvideo.com>; Slurm User Community List <
> slurm-users at lists.schedmd.com>
> *Subject:* Re: [EXTERNAL] Re: [slurm-users] trying to diagnose a
> connectivity issue between the slurmctld process and the slurmd nodes
>
>
> One last shot on the firewall front Steve -- does the control node have a
> firewall enabled? I've seen cases where that can cause the sporadic
> messaging failures that you seem to be seeing.
>
> That failing, I'll defer to anyone with different ideas!
>
> Andy
> On 11/26/2020 1:01 PM, Steve Bland wrote:
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201127/03d25079/attachment-0001.htm>


More information about the slurm-users mailing list