[slurm-users] slurm-users Digest, Vol 37, Issue 46
vero chaul
verochaul at gmail.com
Thu Nov 26 18:41:12 UTC 2020
Baja
El El jue, 26 nov. 2020 a la(s) 15:40, <
slurm-users-request at lists.schedmd.com> escribió:
> Send slurm-users mailing list submissions to
> slurm-users at lists.schedmd.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
> or, via email, send a message with subject or body 'help' to
> slurm-users-request at lists.schedmd.com
>
> You can reach the person managing the list at
> slurm-users-owner at lists.schedmd.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
>
>
> Today's Topics:
>
> 1. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue
> between the slurmctld process and the slurmd nodes (Steve Bland)
> 2. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue
> between the slurmctld process and the slurmd nodes (Andy Riebs)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 26 Nov 2020 18:01:25 +0000
> From: Steve Bland <sbland at rossvideo.com>
> To: "andy at candooz.com" <andy at candooz.com>, Slurm User Community List
> <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a
> connectivity issue between the slurmctld process and the slurmd
> nodes
> Message-ID:
> <
> YTXPR0101MB2302A3F22023838FB5745EA2CFF90 at YTXPR0101MB2302.CANPRD01.PROD.OUTLOOK.COM
> >
>
> Content-Type: text/plain; charset="us-ascii"
>
> Thanks Andy
>
> Firewall is off on all three system. Also if they could not communicate, I
> do not think 'scontrol show node' would not return the data that is does.
> And the logs would not show responses as indicated below
>
> And the names are correct, used the recommended 'hostname -s' when
> configuring the slurm.conf node entries.
> In fact slurm seems to be case sensitive, which surprised the heck out of
> me
>
>
>
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Andy Riebs
> Sent: Thursday, November 26, 2020 12:50
> To: slurm-users at lists.schedmd.com
> Subject: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity
> issue between the slurmctld process and the slurmd nodes
>
>
> 1. Look for a firewall on all of your slurm -- they almost always break
> slurm communications.
> 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
> "srvgridslurm01"
>
> Andy
> On 11/26/2020 12:21 PM, Steve Bland wrote:
>
> Sinfo always returns nodes not responding
> [root at srvgridslurm03 ~]# sinfo -R
> REASON USER TIMESTAMP NODELIST
> Not responding slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
> Not responding slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
> Not responding slurm 2020-11-26T10:00:14 srvgridslurm03
>
>
> By tailing the log for slurmctld, I can see when a node is recognized
> Node srvgridslurm03 now responding
>
>
> By turning up the logging levels I can see comm between slurmctld and the
> nodes and there appears to be a response
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
> [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
> [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
> [2020-11-26T12:05:14.335] debug2: Tree head got back 1
> [2020-11-26T12:05:14.335] debug2: Tree head got back 2
> [2020-11-26T12:05:14.336] debug2: Tree head got back 3
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
> [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03
>
> What I do not understand is the disjoint. It seems to record responses,
> but flags the node as not responding - all nodes. There are only three
> right now as this is a test environment. 3 Centos7 systems
>
> [root at SRVGRIDSLURM01 ~]# scontrol show node
> NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
> CPUAlloc=0 CPUTot=4 CPULoad=0.01
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0
> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
> RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
> CfgTRES=cpu=4,mem=7821M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Not responding [slurm at 2020-11-26T09:12:58]
> Comment=(null)
>
> NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
> CPUAlloc=0 CPUTot=4 CPULoad=0.01
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0
> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
> RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
> CfgTRES=cpu=4,mem=7821M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Not responding [slurm at 2020-11-26T08:27:58]
> Comment=(null)
>
> NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
> CPUAlloc=0 CPUTot=4 CPULoad=0.01
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0
> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
> RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
> CfgTRES=cpu=4,mem=7821M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Not responding [slurm at 2020-11-26T10:00:14]
> Comment=(null)
>
> Any suggestions? Thanks
>
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/cc5da04d/attachment-0001.htm
> >
>
> ------------------------------
>
> Message: 2
> Date: Thu, 26 Nov 2020 13:40:24 -0500
> From: Andy Riebs <andy at candooz.com>
> To: Steve Bland <sbland at rossvideo.com>, Slurm User Community List
> <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a
> connectivity issue between the slurmctld process and the slurmd
> nodes
> Message-ID: <cdd891a8-bcff-8cc7-6b40-5854a8095986 at candooz.com>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>
> One last shot on the firewall front Steve -- does the control node have
> a firewall enabled? I've seen cases where that can cause the sporadic
> messaging failures that you seem to be seeing.
>
> That failing, I'll defer to anyone with different ideas!
>
> Andy
>
> On 11/26/2020 1:01 PM, Steve Bland wrote:
> >
> > Thanks Andy
> >
> > Firewall is off on all three system. Also if they could not
> > communicate, I do not think ?scontrol show node? would not return the
> > data that is does. And the logs would not show responses as indicated
> > below
> >
> > And the names are correct, used the recommended ?hostname -s? when
> > configuring the slurm.conf node entries.
> >
> > In fact slurm seems to be case sensitive, which surprised the heck out
> > of me
> >
> > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
> > Of *Andy Riebs
> > *Sent:* Thursday, November 26, 2020 12:50
> > *To:* slurm-users at lists.schedmd.com
> > *Subject:* [EXTERNAL] Re: [slurm-users] trying to diagnose a
> > connectivity issue between the slurmctld process and the slurmd nodes
> >
> > 1. Look for a firewall on all of your slurm -- they almost always
> > break slurm communications.
> > 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
> > "srvgridslurm01"
> >
> > Andy
> >
> > On 11/26/2020 12:21 PM, Steve Bland wrote:
> >
> > Sinfo always returns nodes not responding
> >
> > [root at srvgridslurm03 ~]# sinfo -R
> >
> > REASON?????????????? USER TIMESTAMP?????????? NODELIST
> >
> > Not responding?????? slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
> >
> > Not responding?????? slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
> >
> > Not responding?????? slurm 2020-11-26T10:00:14 srvgridslurm03
> >
> > By tailing the log for slurmctld, ?I can see when a node is
> recognized
> >
> > Node srvgridslurm03 now responding
> >
> > By turning up the logging levels I can see comm between slurmctld
> > and the nodes and there appears to be a response
> >
> > [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
> >
> > [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
> >
> > [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
> >
> > [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
> >
> > [2020-11-26T12:05:14.335] debug2: Tree head got back 1
> >
> > [2020-11-26T12:05:14.335] debug2: Tree head got back 2
> >
> > [2020-11-26T12:05:14.336] debug2: Tree head got back 3
> >
> > [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
> >
> > [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
> >
> > [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03
> >
> > What I do not understand is the disjoint. It seems to record
> > responses, but flags the node as not responding ? all nodes. There
> > are only three right now as this is a test environment. 3 Centos7
> > systems
> >
> > [root at SRVGRIDSLURM01 ~]# scontrol show node
> >
> > NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
> >
> > ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01
> >
> > ?? AvailableFeatures=(null)
> >
> > ?? ActiveFeatures=(null)
> >
> > ?? Gres=(null)
> >
> > ?? NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01
> Version=20.11.0
> >
> > ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> > UTC 2020
> >
> > ?? RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
> >
> > ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> > MCS_label=N/A
> >
> > ?? Partitions=debug
> >
> > ?? BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
> >
> > ?? CfgTRES=cpu=4,mem=7821M,billing=4
> >
> > ?? AllocTRES=
> >
> > ?? CapWatts=n/a
> >
> > ?? CurrentWatts=0 AveWatts=0
> >
> > ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > ?? Reason=Not responding [slurm at 2020-11-26T09:12:58]
> >
> > ?? Comment=(null)
> >
> > NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
> >
> > ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01
> >
> > ?? AvailableFeatures=(null)
> >
> > ?? ActiveFeatures=(null)
> >
> > ?? Gres=(null)
> >
> > ?? NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02
> Version=20.11.0
> >
> > ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> > UTC 2020
> >
> > ?? RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
> >
> > ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> > MCS_label=N/A
> >
> > ?? Partitions=debug
> >
> > ?? BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
> >
> > ?? CfgTRES=cpu=4,mem=7821M,billing=4
> >
> > ?? AllocTRES=
> >
> > ?? CapWatts=n/a
> >
> > ?? CurrentWatts=0 AveWatts=0
> >
> > ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > ?? Reason=Not responding [slurm at 2020-11-26T08:27:58]
> >
> > ?? Comment=(null)
> >
> > NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
> >
> > ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01
> >
> > ?? AvailableFeatures=(null)
> >
> > ?? ActiveFeatures=(null)
> >
> > ?? Gres=(null)
> >
> > ?? NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03
> Version=20.11.0
> >
> > ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> > UTC 2020
> >
> > ?? RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
> >
> > ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> > MCS_label=N/A
> >
> > ?? Partitions=debug
> >
> > ?? BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
> >
> > ?? CfgTRES=cpu=4,mem=7821M,billing=4
> >
> > ?? AllocTRES=
> >
> > ?? CapWatts=n/a
> >
> > ?? CurrentWatts=0 AveWatts=0
> >
> > ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > ?? Reason=Not responding [slurm at 2020-11-26T10:00:14]
> >
> > ?? Comment=(null)
> >
> > Any suggestions? Thanks
> >
> > ----------------------------------------------
> >
> > This e-mail and any attachments may contain information that is
> > confidential to Ross Video.
> >
> > If you are not the intended recipient, please notify me
> > immediately by replying to this message. Please also delete all
> > copies. Thank you.
> >
> > ----------------------------------------------
> >
> > This e-mail and any attachments may contain information that is
> > confidential to Ross Video.
> >
> > If you are not the intended recipient, please notify me immediately by
> > replying to this message. Please also delete all copies. Thank you.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/c5715744/attachment.htm
> >
>
> End of slurm-users Digest, Vol 37, Issue 46
> *******************************************
>
--
Veronica Chaul
+5411 3581-4041
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/18d4f0df/attachment-0001.htm>
More information about the slurm-users
mailing list