<div dir="auto">Baja</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El El jue, 26 nov. 2020 a la(s) 15:40, <<a href="mailto:slurm-users-request@lists.schedmd.com">slurm-users-request@lists.schedmd.com</a>> escribió:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send slurm-users mailing list submissions to<br>
        <a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
        <a href="https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users" rel="noreferrer" target="_blank">https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users</a><br>
or, via email, send a message with subject or body 'help' to<br>
        <a href="mailto:slurm-users-request@lists.schedmd.com" target="_blank">slurm-users-request@lists.schedmd.com</a><br>
<br>
You can reach the person managing the list at<br>
        <a href="mailto:slurm-users-owner@lists.schedmd.com" target="_blank">slurm-users-owner@lists.schedmd.com</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of slurm-users digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
   1. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue<br>
      between the slurmctld process and the slurmd nodes (Steve Bland)<br>
   2. Re: [EXTERNAL] Re: trying to diagnose a connectivity issue<br>
      between the slurmctld process and the slurmd nodes (Andy Riebs)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Thu, 26 Nov 2020 18:01:25 +0000<br>
From: Steve Bland <<a href="mailto:sbland@rossvideo.com" target="_blank">sbland@rossvideo.com</a>><br>
To: "<a href="mailto:andy@candooz.com" target="_blank">andy@candooz.com</a>" <<a href="mailto:andy@candooz.com" target="_blank">andy@candooz.com</a>>, Slurm User Community List<br>
        <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a<br>
        connectivity issue between the slurmctld process and the slurmd nodes<br>
Message-ID:<br>
        <<a href="mailto:YTXPR0101MB2302A3F22023838FB5745EA2CFF90@YTXPR0101MB2302.CANPRD01.PROD.OUTLOOK.COM" target="_blank">YTXPR0101MB2302A3F22023838FB5745EA2CFF90@YTXPR0101MB2302.CANPRD01.PROD.OUTLOOK.COM</a>><br>
<br>
Content-Type: text/plain; charset="us-ascii"<br>
<br>
Thanks Andy<br>
<br>
Firewall is off on all three system. Also if they could not communicate, I do not think 'scontrol show node' would not return the data that is does. And the logs would not show responses as indicated below<br>
<br>
And the names are correct, used the recommended 'hostname -s' when configuring the slurm.conf node entries.<br>
In fact slurm seems to be case sensitive, which surprised the heck out of me<br>
<br>
<br>
<br>
From: slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> On Behalf Of Andy Riebs<br>
Sent: Thursday, November 26, 2020 12:50<br>
To: <a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
Subject: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes<br>
<br>
<br>
  1.  Look for a firewall on all of your slurm -- they almost always break slurm communications.<br>
  2.  Confirm that "ssh srvgridslurm01 hostname" returns, exactly, "srvgridslurm01"<br>
<br>
Andy<br>
On 11/26/2020 12:21 PM, Steve Bland wrote:<br>
<br>
Sinfo always returns nodes not responding<br>
[root@srvgridslurm03 ~]# sinfo -R<br>
REASON               USER      TIMESTAMP           NODELIST<br>
Not responding       slurm     2020-11-26T09:12:58 SRVGRIDSLURM01<br>
Not responding       slurm     2020-11-26T08:27:58 SRVGRIDSLURM02<br>
Not responding       slurm     2020-11-26T10:00:14 srvgridslurm03<br>
<br>
<br>
By tailing the log for slurmctld,  I can see when a node is recognized<br>
Node srvgridslurm03 now responding<br>
<br>
<br>
By turning up the logging levels I can see comm between slurmctld and the nodes and there appears to be a response<br>
<br>
[2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01<br>
[2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3<br>
[2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02<br>
[2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03<br>
[2020-11-26T12:05:14.335] debug2: Tree head got back 1<br>
[2020-11-26T12:05:14.335] debug2: Tree head got back 2<br>
[2020-11-26T12:05:14.336] debug2: Tree head got back 3<br>
[2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01<br>
[2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02<br>
[2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03<br>
<br>
What I do not understand is the disjoint. It seems to record responses, but flags the node as not responding - all nodes. There are only three right now as this is a test environment. 3 Centos7 systems<br>
<br>
[root@SRVGRIDSLURM01 ~]# scontrol show node<br>
NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4<br>
   CPUAlloc=0 CPUTot=4 CPULoad=0.01<br>
   AvailableFeatures=(null)<br>
   ActiveFeatures=(null)<br>
   Gres=(null)<br>
   NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0<br>
   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020<br>
   RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1<br>
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
   Partitions=debug<br>
   BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25<br>
   CfgTRES=cpu=4,mem=7821M,billing=4<br>
   AllocTRES=<br>
   CapWatts=n/a<br>
   CurrentWatts=0 AveWatts=0<br>
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
   Reason=Not responding [slurm@2020-11-26T09:12:58]<br>
   Comment=(null)<br>
<br>
NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4<br>
   CPUAlloc=0 CPUTot=4 CPULoad=0.01<br>
   AvailableFeatures=(null)<br>
   ActiveFeatures=(null)<br>
   Gres=(null)<br>
   NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0<br>
   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020<br>
   RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1<br>
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
   Partitions=debug<br>
   BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08<br>
   CfgTRES=cpu=4,mem=7821M,billing=4<br>
   AllocTRES=<br>
   CapWatts=n/a<br>
   CurrentWatts=0 AveWatts=0<br>
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
   Reason=Not responding [slurm@2020-11-26T08:27:58]<br>
   Comment=(null)<br>
<br>
NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4<br>
   CPUAlloc=0 CPUTot=4 CPULoad=0.01<br>
   AvailableFeatures=(null)<br>
   ActiveFeatures=(null)<br>
   Gres=(null)<br>
   NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0<br>
   OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020<br>
   RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1<br>
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
   Partitions=debug<br>
   BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23<br>
   CfgTRES=cpu=4,mem=7821M,billing=4<br>
   AllocTRES=<br>
   CapWatts=n/a<br>
   CurrentWatts=0 AveWatts=0<br>
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
   Reason=Not responding [slurm@2020-11-26T10:00:14]<br>
   Comment=(null)<br>
<br>
Any suggestions? Thanks<br>
<br>
<br>
----------------------------------------------<br>
<br>
This e-mail and any attachments may contain information that is confidential to Ross Video.<br>
<br>
If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.<br>
----------------------------------------------<br>
<br>
This e-mail and any attachments may contain information that is confidential to Ross Video.<br>
<br>
If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/cc5da04d/attachment-0001.htm" rel="noreferrer" target="_blank">http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/cc5da04d/attachment-0001.htm</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Thu, 26 Nov 2020 13:40:24 -0500<br>
From: Andy Riebs <<a href="mailto:andy@candooz.com" target="_blank">andy@candooz.com</a>><br>
To: Steve Bland <<a href="mailto:sbland@rossvideo.com" target="_blank">sbland@rossvideo.com</a>>, Slurm User Community List<br>
        <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a<br>
        connectivity issue between the slurmctld process and the slurmd nodes<br>
Message-ID: <<a href="mailto:cdd891a8-bcff-8cc7-6b40-5854a8095986@candooz.com" target="_blank">cdd891a8-bcff-8cc7-6b40-5854a8095986@candooz.com</a>><br>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"<br>
<br>
One last shot on the firewall front Steve -- does the control node have <br>
a firewall enabled? I've seen cases where that can cause the sporadic <br>
messaging failures that you seem to be seeing.<br>
<br>
That failing, I'll defer to anyone with different ideas!<br>
<br>
Andy<br>
<br>
On 11/26/2020 1:01 PM, Steve Bland wrote:<br>
><br>
> Thanks Andy<br>
><br>
> Firewall is off on all three system. Also if they could not <br>
> communicate, I do not think ?scontrol show node? would not return the <br>
> data that is does. And the logs would not show responses as indicated <br>
> below<br>
><br>
> And the names are correct, used the recommended ?hostname -s? when <br>
> configuring the slurm.conf node entries.<br>
><br>
> In fact slurm seems to be case sensitive, which surprised the heck out <br>
> of me<br>
><br>
> *From:* slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> *On Behalf <br>
> Of *Andy Riebs<br>
> *Sent:* Thursday, November 26, 2020 12:50<br>
> *To:* <a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
> *Subject:* [EXTERNAL] Re: [slurm-users] trying to diagnose a <br>
> connectivity issue between the slurmctld process and the slurmd nodes<br>
><br>
>  1. Look for a firewall on all of your slurm -- they almost always<br>
>     break slurm communications.<br>
>  2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,<br>
>     "srvgridslurm01"<br>
><br>
> Andy<br>
><br>
> On 11/26/2020 12:21 PM, Steve Bland wrote:<br>
><br>
>     Sinfo always returns nodes not responding<br>
><br>
>     [root@srvgridslurm03 ~]# sinfo -R<br>
><br>
>     REASON?????????????? USER TIMESTAMP?????????? NODELIST<br>
><br>
>     Not responding?????? slurm 2020-11-26T09:12:58 SRVGRIDSLURM01<br>
><br>
>     Not responding?????? slurm 2020-11-26T08:27:58 SRVGRIDSLURM02<br>
><br>
>     Not responding?????? slurm 2020-11-26T10:00:14 srvgridslurm03<br>
><br>
>     By tailing the log for slurmctld, ?I can see when a node is recognized<br>
><br>
>     Node srvgridslurm03 now responding<br>
><br>
>     By turning up the logging levels I can see comm between slurmctld<br>
>     and the nodes and there appears to be a response<br>
><br>
>     [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01<br>
><br>
>     [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3<br>
><br>
>     [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02<br>
><br>
>     [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03<br>
><br>
>     [2020-11-26T12:05:14.335] debug2: Tree head got back 1<br>
><br>
>     [2020-11-26T12:05:14.335] debug2: Tree head got back 2<br>
><br>
>     [2020-11-26T12:05:14.336] debug2: Tree head got back 3<br>
><br>
>     [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01<br>
><br>
>     [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02<br>
><br>
>     [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03<br>
><br>
>     What I do not understand is the disjoint. It seems to record<br>
>     responses, but flags the node as not responding ? all nodes. There<br>
>     are only three right now as this is a test environment. 3 Centos7<br>
>     systems<br>
><br>
>     [root@SRVGRIDSLURM01 ~]# scontrol show node<br>
><br>
>     NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4<br>
><br>
>     ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01<br>
><br>
>     ?? AvailableFeatures=(null)<br>
><br>
>     ?? ActiveFeatures=(null)<br>
><br>
>     ?? Gres=(null)<br>
><br>
>     ?? NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0<br>
><br>
>     ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08<br>
>     UTC 2020<br>
><br>
>     ?? RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1<br>
><br>
>     ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A<br>
>     MCS_label=N/A<br>
><br>
>     ?? Partitions=debug<br>
><br>
>     ?? BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25<br>
><br>
>     ?? CfgTRES=cpu=4,mem=7821M,billing=4<br>
><br>
>     ?? AllocTRES=<br>
><br>
>     ?? CapWatts=n/a<br>
><br>
>     ?? CurrentWatts=0 AveWatts=0<br>
><br>
>     ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
><br>
>     ?? Reason=Not responding [slurm@2020-11-26T09:12:58]<br>
><br>
>     ?? Comment=(null)<br>
><br>
>     NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4<br>
><br>
>     ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01<br>
><br>
>     ?? AvailableFeatures=(null)<br>
><br>
>     ?? ActiveFeatures=(null)<br>
><br>
>     ?? Gres=(null)<br>
><br>
>     ?? NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0<br>
><br>
>     ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08<br>
>     UTC 2020<br>
><br>
>     ?? RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1<br>
><br>
>     ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A<br>
>     MCS_label=N/A<br>
><br>
>     ?? Partitions=debug<br>
><br>
>     ?? BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08<br>
><br>
>     ?? CfgTRES=cpu=4,mem=7821M,billing=4<br>
><br>
>     ?? AllocTRES=<br>
><br>
>     ?? CapWatts=n/a<br>
><br>
>     ?? CurrentWatts=0 AveWatts=0<br>
><br>
>     ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
><br>
>     ?? Reason=Not responding [slurm@2020-11-26T08:27:58]<br>
><br>
>     ?? Comment=(null)<br>
><br>
>     NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4<br>
><br>
>     ?? CPUAlloc=0 CPUTot=4 CPULoad=0.01<br>
><br>
>     ?? AvailableFeatures=(null)<br>
><br>
>     ?? ActiveFeatures=(null)<br>
><br>
>     ?? Gres=(null)<br>
><br>
>     ?? NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0<br>
><br>
>     ?? OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08<br>
>     UTC 2020<br>
><br>
>     ?? RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1<br>
><br>
>     ?? State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A<br>
>     MCS_label=N/A<br>
><br>
>     ?? Partitions=debug<br>
><br>
>     ?? BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23<br>
><br>
>     ?? CfgTRES=cpu=4,mem=7821M,billing=4<br>
><br>
>     ?? AllocTRES=<br>
><br>
>     ?? CapWatts=n/a<br>
><br>
>     ?? CurrentWatts=0 AveWatts=0<br>
><br>
>     ?? ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
><br>
>     ?? Reason=Not responding [slurm@2020-11-26T10:00:14]<br>
><br>
>     ?? Comment=(null)<br>
><br>
>     Any suggestions? Thanks<br>
><br>
>     ----------------------------------------------<br>
><br>
>     This e-mail and any attachments may contain information that is<br>
>     confidential to Ross Video.<br>
><br>
>     If you are not the intended recipient, please notify me<br>
>     immediately by replying to this message. Please also delete all<br>
>     copies. Thank you.<br>
><br>
> ----------------------------------------------<br>
><br>
> This e-mail and any attachments may contain information that is <br>
> confidential to Ross Video.<br>
><br>
> If you are not the intended recipient, please notify me immediately by <br>
> replying to this message. Please also delete all copies. Thank you. <br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/c5715744/attachment.htm" rel="noreferrer" target="_blank">http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/c5715744/attachment.htm</a>><br>
<br>
End of slurm-users Digest, Vol 37, Issue 46<br>
*******************************************<br>
</blockquote></div></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Veronica Chaul</div>+5411 3581-4041<div><img alt=""><br><div><img alt=""><br></div></div></div></div>