[slurm-users] slurm-users Digest, Vol 37, Issue 46

vero chaul verochaul at gmail.com
Thu Nov 26 18:41:12 UTC 2020


Baja

El El jue, 26 nov. 2020 a la(s) 15:40, <
slurm-users-request at lists.schedmd.com> escribió:

> Send slurm-users mailing list submissions to
>         slurm-users at lists.schedmd.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
> or, via email, send a message with subject or body 'help' to
>         slurm-users-request at lists.schedmd.com
>
> You can reach the person managing the list at
>         slurm-users-owner at lists.schedmd.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
>
>
> Today's Topics:
>
>    1. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue
>       between the slurmctld process and the slurmd nodes (Steve Bland)
>    2. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue
>       between the slurmctld process and the slurmd nodes (Andy Riebs)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 26 Nov 2020 18:01:25 +0000
> From: Steve Bland <sbland at rossvideo.com>
> To: "andy at candooz.com" <andy at candooz.com>, Slurm User Community List
>         <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a
>         connectivity issue between the slurmctld process and the slurmd
> nodes
> Message-ID:
>         <
> YTXPR0101MB2302A3F22023838FB5745EA2CFF90 at YTXPR0101MB2302.CANPRD01.PROD.OUTLOOK.COM
> >
>
> Content-Type: text/plain; charset="us-ascii"
>
> Thanks Andy
>
> Firewall is off on all three system. Also if they could not communicate, I
> do not think 'scontrol show node' would not return the data that is does.
> And the logs would not show responses as indicated below
>
> And the names are correct, used the recommended 'hostname -s' when
> configuring the slurm.conf node entries.
> In fact slurm seems to be case sensitive, which surprised the heck out of
> me
>
>
>
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Andy Riebs
> Sent: Thursday, November 26, 2020 12:50
> To: slurm-users at lists.schedmd.com
> Subject: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity
> issue between the slurmctld process and the slurmd nodes
>
>
>   1.  Look for a firewall on all of your slurm -- they almost always break
> slurm communications.
>   2.  Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
> "srvgridslurm01"
>
> Andy
> On 11/26/2020 12:21 PM, Steve Bland wrote:
>
> Sinfo always returns nodes not responding
> [root at srvgridslurm03 ~]# sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Not responding       slurm     2020-11-26T09:12:58 SRVGRIDSLURM01
> Not responding       slurm     2020-11-26T08:27:58 SRVGRIDSLURM02
> Not responding       slurm     2020-11-26T10:00:14 srvgridslurm03
>
>
> By tailing the log for slurmctld,  I can see when a node is recognized
> Node srvgridslurm03 now responding
>
>
> By turning up the logging levels I can see comm between slurmctld and the
> nodes and there appears to be a response
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
> [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
> [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
> [2020-11-26T12:05:14.335] debug2: Tree head got back 1
> [2020-11-26T12:05:14.335] debug2: Tree head got back 2
> [2020-11-26T12:05:14.336] debug2: Tree head got back 3
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
> [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03
>
> What I do not understand is the disjoint. It seems to record responses,
> but flags the node as not responding - all nodes. There are only three
> right now as this is a test environment. 3 Centos7 systems
>
> [root at SRVGRIDSLURM01 ~]# scontrol show node
> NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
>    CPUAlloc=0 CPUTot=4 CPULoad=0.01
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0
>    OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
>    RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=debug
>    BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
>    CfgTRES=cpu=4,mem=7821M,billing=4
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Not responding [slurm at 2020-11-26T09:12:58]
>    Comment=(null)
>
> NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
>    CPUAlloc=0 CPUTot=4 CPULoad=0.01
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0
>    OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
>    RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=debug
>    BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
>    CfgTRES=cpu=4,mem=7821M,billing=4
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Not responding [slurm at 2020-11-26T08:27:58]
>    Comment=(null)
>
> NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
>    CPUAlloc=0 CPUTot=4 CPULoad=0.01
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0
>    OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
>    RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=debug
>    BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
>    CfgTRES=cpu=4,mem=7821M,billing=4
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Not responding [slurm at 2020-11-26T10:00:14]
>    Comment=(null)
>
> Any suggestions? Thanks
>
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/cc5da04d/attachment-0001.htm
> >
>
> ------------------------------
>
> Message: 2
> Date: Thu, 26 Nov 2020 13:40:24 -0500
> From: Andy Riebs <andy at candooz.com>
> To: Steve Bland <sbland at rossvideo.com>, Slurm User Community List
>         <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a
>         connectivity issue between the slurmctld process and the slurmd
> nodes
> Message-ID: <cdd891a8-bcff-8cc7-6b40-5854a8095986 at candooz.com>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>
> One last shot on the firewall front Steve -- does the control node have
> a firewall enabled? I've seen cases where that can cause the sporadic
> messaging failures that you seem to be seeing.
>
> That failing, I'll defer to anyone with different ideas!
>
> Andy
>
> On 11/26/2020 1:01 PM, Steve Bland wrote:
> >
> > Thanks Andy
> >
> > Firewall is off on all three system. Also if they could not
> > communicate, I do not think ?scontrol show node? would not return the
> > data that is does. And the logs would not show responses as indicated
> > below
> >
> > And the names are correct, used the recommended ?hostname -s? when
> > configuring the slurm.conf node entries.
> >
> > In fact slurm seems to be case sensitive, which surprised the heck out
> > of me
> >
> > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
> > Of *Andy Riebs
> > *Sent:* Thursday, November 26, 2020 12:50
> > *To:* slurm-users at lists.schedmd.com
> > *Subject:* [EXTERNAL] Re: [slurm-users] trying to diagnose a
> > connectivity issue between the slurmctld process and the slurmd nodes
> >
> >  1. Look for a firewall on all of your slurm -- they almost always
> >     break slurm communications.
> >  2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
> >     "srvgridslurm01"
> >
> > Andy
> >
> > On 11/26/2020 12:21 PM, Steve Bland wrote:
> >
> >     Sinfo always returns nodes not responding
> >
> >     [root at srvgridslurm03 ~]# sinfo -R
> >
> >     REASON?????????????? USER TIMESTAMP?????????? NODELIST
> >
> >     Not responding?????? slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
> >
> >     Not responding?????? slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
> >
> >     Not responding?????? slurm 2020-11-26T10:00:14 srvgridslurm03
> >
> >     By tailing the log for slurmctld, ?I can see when a node is
> recognized
> >
> >     Node srvgridslurm03 now responding
> >
> >     By turning up the logging levels I can see comm between slurmctld
> >     and the nodes and there appears to be a response
> >
> >     [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
> >
> >     [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
> >
> >     [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
> >
> >     [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
> >
> >     [2020-11-26T12:05:14.335] debug2: Tree head got back 1
> >
> >     [2020-11-26T12:05:14.335] debug2: Tree head got back 2
> >
> >     [2020-11-26T12:05:14.336] debug2: Tree head got back 3
> >
> >     [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
> >
> >     [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
> >
> >     [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03
> >
> >     What I do not understand is the disjoint. It seems to record
> >     responses, but flags the node as not responding ? all nodes. There
> >     are only three right now as this is a test environment. 3 Centos7
> >     systems
> >
> >     [root at SRVGRIDSLURM01 ~]# scontrol show node
> >
> >     NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
> >
> >     ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01
> >
> >     ?? AvailableFeatures=(null)
> >
> >     ?? ActiveFeatures=(null)
> >
> >     ?? Gres=(null)
> >
> >     ?? NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01
> Version=20.11.0
> >
> >     ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> >     UTC 2020
> >
> >     ?? RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
> >
> >     ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> >     MCS_label=N/A
> >
> >     ?? Partitions=debug
> >
> >     ?? BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
> >
> >     ?? CfgTRES=cpu=4,mem=7821M,billing=4
> >
> >     ?? AllocTRES=
> >
> >     ?? CapWatts=n/a
> >
> >     ?? CurrentWatts=0 AveWatts=0
> >
> >     ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >     ?? Reason=Not responding [slurm at 2020-11-26T09:12:58]
> >
> >     ?? Comment=(null)
> >
> >     NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
> >
> >     ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01
> >
> >     ?? AvailableFeatures=(null)
> >
> >     ?? ActiveFeatures=(null)
> >
> >     ?? Gres=(null)
> >
> >     ?? NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02
> Version=20.11.0
> >
> >     ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> >     UTC 2020
> >
> >     ?? RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
> >
> >     ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> >     MCS_label=N/A
> >
> >     ?? Partitions=debug
> >
> >     ?? BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
> >
> >     ?? CfgTRES=cpu=4,mem=7821M,billing=4
> >
> >     ?? AllocTRES=
> >
> >     ?? CapWatts=n/a
> >
> >     ?? CurrentWatts=0 AveWatts=0
> >
> >     ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >     ?? Reason=Not responding [slurm at 2020-11-26T08:27:58]
> >
> >     ?? Comment=(null)
> >
> >     NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
> >
> >     ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01
> >
> >     ?? AvailableFeatures=(null)
> >
> >     ?? ActiveFeatures=(null)
> >
> >     ?? Gres=(null)
> >
> >     ?? NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03
> Version=20.11.0
> >
> >     ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08
> >     UTC 2020
> >
> >     ?? RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
> >
> >     ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> >     MCS_label=N/A
> >
> >     ?? Partitions=debug
> >
> >     ?? BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
> >
> >     ?? CfgTRES=cpu=4,mem=7821M,billing=4
> >
> >     ?? AllocTRES=
> >
> >     ?? CapWatts=n/a
> >
> >     ?? CurrentWatts=0 AveWatts=0
> >
> >     ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >     ?? Reason=Not responding [slurm at 2020-11-26T10:00:14]
> >
> >     ?? Comment=(null)
> >
> >     Any suggestions? Thanks
> >
> >     ----------------------------------------------
> >
> >     This e-mail and any attachments may contain information that is
> >     confidential to Ross Video.
> >
> >     If you are not the intended recipient, please notify me
> >     immediately by replying to this message. Please also delete all
> >     copies. Thank you.
> >
> > ----------------------------------------------
> >
> > This e-mail and any attachments may contain information that is
> > confidential to Ross Video.
> >
> > If you are not the intended recipient, please notify me immediately by
> > replying to this message. Please also delete all copies. Thank you.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/c5715744/attachment.htm
> >
>
> End of slurm-users Digest, Vol 37, Issue 46
> *******************************************
>
-- 
Veronica Chaul
+5411 3581-4041
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/18d4f0df/attachment-0001.htm>


More information about the slurm-users mailing list