<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<ol>
<li>Look for a firewall on all of your slurm -- they almost always
break slurm communications.<br>
</li>
<li>Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
"srvgridslurm01"</li>
</ol>
<p>Andy<br>
</p>
<div class="moz-cite-prefix">On 11/26/2020 12:21 PM, Steve Bland
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:YTXPR0101MB23024DFAA2C0D1ABF7946A2ECFF90@YTXPR0101MB2302.CANPRD01.PROD.OUTLOOK.COM">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Sinfo always returns nodes not responding<o:p></o:p></p>
<p class="MsoNormal">[root@srvgridslurm03 ~]# sinfo -R<o:p></o:p></p>
<p class="MsoNormal">REASON USER
TIMESTAMP NODELIST<o:p></o:p></p>
<p class="MsoNormal">Not responding slurm
2020-11-26T09:12:58 SRVGRIDSLURM01<o:p></o:p></p>
<p class="MsoNormal">Not responding slurm
2020-11-26T08:27:58 SRVGRIDSLURM02<o:p></o:p></p>
<p class="MsoNormal">Not responding slurm
2020-11-26T10:00:14 srvgridslurm03<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">By tailing the log for slurmctld, I can
see when a node is recognized<o:p></o:p></p>
<p class="MsoNormal">Node srvgridslurm03 now responding<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">By turning up the logging levels I can see
comm between slurmctld and the nodes and there appears to be a
response<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.333] debug3: Tree
sending to SRVGRIDSLURM01<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.333] debug2: Tree head
got back 0 looking for 3<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.333] debug3: Tree
sending to SRVGRIDSLURM02<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.333] debug3: Tree
sending to srvgridslurm03<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.335] debug2: Tree head
got back 1<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.335] debug2: Tree head
got back 2<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.336] debug2: Tree head
got back 3<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.338] debug2:
node_did_resp SRVGRIDSLURM01<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.338] debug2:
node_did_resp SRVGRIDSLURM02<o:p></o:p></p>
<p class="MsoNormal">[2020-11-26T12:05:14.338] debug2:
node_did_resp srvgridslurm03<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">What I do not understand is the disjoint.
It seems to record responses, but flags the node as not
responding – all nodes. There are only three right now as this
is a test environment. 3 Centos7 systems<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">[root@SRVGRIDSLURM01 ~]# scontrol show node<o:p></o:p></p>
<p class="MsoNormal">NodeName=SRVGRIDSLURM01 Arch=x86_64
CoresPerSocket=4<o:p></o:p></p>
<p class="MsoNormal"> CPUAlloc=0 CPUTot=4 CPULoad=0.01<o:p></o:p></p>
<p class="MsoNormal"> AvailableFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> ActiveFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> Gres=(null)<o:p></o:p></p>
<p class="MsoNormal"> NodeAddr=SRVGRIDSLURM01
NodeHostName=SRVGRIDSLURM01 Version=20.11.0<o:p></o:p></p>
<p class="MsoNormal"> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1
SMP Tue Oct 20 16:53:08 UTC 2020<o:p></o:p></p>
<p class="MsoNormal"> RealMemory=7821 AllocMem=0 FreeMem=5211
Sockets=1 Boards=1<o:p></o:p></p>
<p class="MsoNormal"> State=DOWN ThreadsPerCore=1 TmpDisk=0
Weight=1 Owner=N/A MCS_label=N/A<o:p></o:p></p>
<p class="MsoNormal"> Partitions=debug<o:p></o:p></p>
<p class="MsoNormal"> BootTime=2020-11-24T08:04:25
SlurmdStartTime=2020-11-26T11:38:25<o:p></o:p></p>
<p class="MsoNormal"> CfgTRES=cpu=4,mem=7821M,billing=4<o:p></o:p></p>
<p class="MsoNormal"> AllocTRES=<o:p></o:p></p>
<p class="MsoNormal"> CapWatts=n/a<o:p></o:p></p>
<p class="MsoNormal"> CurrentWatts=0 AveWatts=0<o:p></o:p></p>
<p class="MsoNormal"> ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s<o:p></o:p></p>
<p class="MsoNormal"> Reason=Not responding
[slurm@2020-11-26T09:12:58]<o:p></o:p></p>
<p class="MsoNormal"> Comment=(null)<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">NodeName=SRVGRIDSLURM02 Arch=x86_64
CoresPerSocket=4<o:p></o:p></p>
<p class="MsoNormal"> CPUAlloc=0 CPUTot=4 CPULoad=0.01<o:p></o:p></p>
<p class="MsoNormal"> AvailableFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> ActiveFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> Gres=(null)<o:p></o:p></p>
<p class="MsoNormal"> NodeAddr=SRVGRIDSLURM02
NodeHostName=SRVGRIDSLURM02 Version=20.11.0<o:p></o:p></p>
<p class="MsoNormal"> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1
SMP Tue Oct 20 16:53:08 UTC 2020<o:p></o:p></p>
<p class="MsoNormal"> RealMemory=7821 AllocMem=0 FreeMem=6900
Sockets=1 Boards=1<o:p></o:p></p>
<p class="MsoNormal"> State=DOWN ThreadsPerCore=1 TmpDisk=0
Weight=1 Owner=N/A MCS_label=N/A<o:p></o:p></p>
<p class="MsoNormal"> Partitions=debug<o:p></o:p></p>
<p class="MsoNormal"> BootTime=2020-11-24T08:04:32
SlurmdStartTime=2020-11-26T10:31:08<o:p></o:p></p>
<p class="MsoNormal"> CfgTRES=cpu=4,mem=7821M,billing=4<o:p></o:p></p>
<p class="MsoNormal"> AllocTRES=<o:p></o:p></p>
<p class="MsoNormal"> CapWatts=n/a<o:p></o:p></p>
<p class="MsoNormal"> CurrentWatts=0 AveWatts=0<o:p></o:p></p>
<p class="MsoNormal"> ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s<o:p></o:p></p>
<p class="MsoNormal"> Reason=Not responding
[slurm@2020-11-26T08:27:58]<o:p></o:p></p>
<p class="MsoNormal"> Comment=(null)<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">NodeName=srvgridslurm03 Arch=x86_64
CoresPerSocket=4<o:p></o:p></p>
<p class="MsoNormal"> CPUAlloc=0 CPUTot=4 CPULoad=0.01<o:p></o:p></p>
<p class="MsoNormal"> AvailableFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> ActiveFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> Gres=(null)<o:p></o:p></p>
<p class="MsoNormal"> NodeAddr=srvgridslurm03
NodeHostName=srvgridslurm03 Version=20.11.0<o:p></o:p></p>
<p class="MsoNormal"> OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1
SMP Tue Oct 20 16:53:08 UTC 2020<o:p></o:p></p>
<p class="MsoNormal"> RealMemory=7821 AllocMem=0 FreeMem=7170
Sockets=1 Boards=1<o:p></o:p></p>
<p class="MsoNormal"> State=DOWN ThreadsPerCore=1 TmpDisk=0
Weight=1 Owner=N/A MCS_label=N/A<o:p></o:p></p>
<p class="MsoNormal"> Partitions=debug<o:p></o:p></p>
<p class="MsoNormal"> BootTime=2020-11-26T09:46:49
SlurmdStartTime=2020-11-26T11:55:23<o:p></o:p></p>
<p class="MsoNormal"> CfgTRES=cpu=4,mem=7821M,billing=4<o:p></o:p></p>
<p class="MsoNormal"> AllocTRES=<o:p></o:p></p>
<p class="MsoNormal"> CapWatts=n/a<o:p></o:p></p>
<p class="MsoNormal"> CurrentWatts=0 AveWatts=0<o:p></o:p></p>
<p class="MsoNormal"> ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s<o:p></o:p></p>
<p class="MsoNormal"> Reason=Not responding
[slurm@2020-11-26T10:00:14]<o:p></o:p></p>
<p class="MsoNormal"> Comment=(null)<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Any suggestions? Thanks<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
---------------------------------------------- <br>
<br>
This e-mail and any attachments may contain information that is
confidential to Ross Video.
<br>
<br>
If you are not the intended recipient, please notify me
immediately by replying to this message. Please also delete all
copies. Thank you.
</blockquote>
</body>
</html>