[slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
Andy Riebs
andy at candooz.com
Thu Nov 26 18:40:24 UTC 2020
One last shot on the firewall front Steve -- does the control node have
a firewall enabled? I've seen cases where that can cause the sporadic
messaging failures that you seem to be seeing.
That failing, I'll defer to anyone with different ideas!
Andy
On 11/26/2020 1:01 PM, Steve Bland wrote:
>
> Thanks Andy
>
> Firewall is off on all three system. Also if they could not
> communicate, I do not think ‘scontrol show node’ would not return the
> data that is does. And the logs would not show responses as indicated
> below
>
> And the names are correct, used the recommended ‘hostname -s’ when
> configuring the slurm.conf node entries.
>
> In fact slurm seems to be case sensitive, which surprised the heck out
> of me
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
> Of *Andy Riebs
> *Sent:* Thursday, November 26, 2020 12:50
> *To:* slurm-users at lists.schedmd.com
> *Subject:* [EXTERNAL] Re: [slurm-users] trying to diagnose a
> connectivity issue between the slurmctld process and the slurmd nodes
>
> 1. Look for a firewall on all of your slurm -- they almost always
> break slurm communications.
> 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
> "srvgridslurm01"
>
> Andy
>
> On 11/26/2020 12:21 PM, Steve Bland wrote:
>
> Sinfo always returns nodes not responding
>
> [root at srvgridslurm03 ~]# sinfo -R
>
> REASON USER TIMESTAMP NODELIST
>
> Not responding slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
>
> Not responding slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
>
> Not responding slurm 2020-11-26T10:00:14 srvgridslurm03
>
> By tailing the log for slurmctld, I can see when a node is recognized
>
> Node srvgridslurm03 now responding
>
> By turning up the logging levels I can see comm between slurmctld
> and the nodes and there appears to be a response
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
>
> [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
>
> [2020-11-26T12:05:14.335] debug2: Tree head got back 1
>
> [2020-11-26T12:05:14.335] debug2: Tree head got back 2
>
> [2020-11-26T12:05:14.336] debug2: Tree head got back 3
>
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
>
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
>
> [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03
>
> What I do not understand is the disjoint. It seems to record
> responses, but flags the node as not responding – all nodes. There
> are only three right now as this is a test environment. 3 Centos7
> systems
>
> [root at SRVGRIDSLURM01 ~]# scontrol show node
>
> NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
>
> CPUAlloc=0 CPUTot=4 CPULoad=0.01
>
> AvailableFeatures=(null)
>
> ActiveFeatures=(null)
>
> Gres=(null)
>
> NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0
>
> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> UTC 2020
>
> RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
>
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>
> Partitions=debug
>
> BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
>
> CfgTRES=cpu=4,mem=7821M,billing=4
>
> AllocTRES=
>
> CapWatts=n/a
>
> CurrentWatts=0 AveWatts=0
>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Reason=Not responding [slurm at 2020-11-26T09:12:58]
>
> Comment=(null)
>
> NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
>
> CPUAlloc=0 CPUTot=4 CPULoad=0.01
>
> AvailableFeatures=(null)
>
> ActiveFeatures=(null)
>
> Gres=(null)
>
> NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0
>
> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> UTC 2020
>
> RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
>
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>
> Partitions=debug
>
> BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
>
> CfgTRES=cpu=4,mem=7821M,billing=4
>
> AllocTRES=
>
> CapWatts=n/a
>
> CurrentWatts=0 AveWatts=0
>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Reason=Not responding [slurm at 2020-11-26T08:27:58]
>
> Comment=(null)
>
> NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
>
> CPUAlloc=0 CPUTot=4 CPULoad=0.01
>
> AvailableFeatures=(null)
>
> ActiveFeatures=(null)
>
> Gres=(null)
>
> NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0
>
> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> UTC 2020
>
> RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
>
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>
> Partitions=debug
>
> BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
>
> CfgTRES=cpu=4,mem=7821M,billing=4
>
> AllocTRES=
>
> CapWatts=n/a
>
> CurrentWatts=0 AveWatts=0
>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Reason=Not responding [slurm at 2020-11-26T10:00:14]
>
> Comment=(null)
>
> Any suggestions? Thanks
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me
> immediately by replying to this message. Please also delete all
> copies. Thank you.
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/c5715744/attachment-0001.htm>
More information about the slurm-users
mailing list