[slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Andy Riebs andy at candooz.com
Thu Nov 26 17:50:08 UTC 2020


 1. Look for a firewall on all of your slurm -- they almost always break
    slurm communications.
 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
    "srvgridslurm01"

Andy

On 11/26/2020 12:21 PM, Steve Bland wrote:
>
> Sinfo always returns nodes not responding
>
> [root at srvgridslurm03 ~]# sinfo -R
>
> REASON               USER TIMESTAMP           NODELIST
>
> Not responding       slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
>
> Not responding       slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
>
> Not responding       slurm 2020-11-26T10:00:14 srvgridslurm03
>
> By tailing the log for slurmctld,  I can see when a node is recognized
>
> Node srvgridslurm03 now responding
>
> By turning up the logging levels I can see comm between slurmctld and 
> the nodes and there appears to be a response
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM01
>
> [2020-11-26T12:05:14.333] debug2: Tree head got back 0 looking for 3
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to SRVGRIDSLURM02
>
> [2020-11-26T12:05:14.333] debug3: Tree sending to srvgridslurm03
>
> [2020-11-26T12:05:14.335] debug2: Tree head got back 1
>
> [2020-11-26T12:05:14.335] debug2: Tree head got back 2
>
> [2020-11-26T12:05:14.336] debug2: Tree head got back 3
>
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM01
>
> [2020-11-26T12:05:14.338] debug2: node_did_resp SRVGRIDSLURM02
>
> [2020-11-26T12:05:14.338] debug2: node_did_resp srvgridslurm03
>
> What I do not understand is the disjoint. It seems to record 
> responses, but flags the node as not responding – all nodes. There are 
> only three right now as this is a test environment. 3 Centos7 systems
>
> [root at SRVGRIDSLURM01 ~]# scontrol show node
>
> NodeName=SRVGRIDSLURM01 Arch=x86_64 CoresPerSocket=4
>
>    CPUAlloc=0 CPUTot=4 CPULoad=0.01
>
>    AvailableFeatures=(null)
>
>    ActiveFeatures=(null)
>
>    Gres=(null)
>
>    NodeAddr=SRVGRIDSLURM01 NodeHostName=SRVGRIDSLURM01 Version=20.11.0
>
>    OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
>
>    RealMemory=7821 AllocMem=0 FreeMem=5211 Sockets=1 Boards=1
>
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>    Partitions=debug
>
>    BootTime=2020-11-24T08:04:25 SlurmdStartTime=2020-11-26T11:38:25
>
>    CfgTRES=cpu=4,mem=7821M,billing=4
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 AveWatts=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>    Reason=Not responding [slurm at 2020-11-26T09:12:58]
>
>    Comment=(null)
>
> NodeName=SRVGRIDSLURM02 Arch=x86_64 CoresPerSocket=4
>
>    CPUAlloc=0 CPUTot=4 CPULoad=0.01
>
>    AvailableFeatures=(null)
>
>    ActiveFeatures=(null)
>
>    Gres=(null)
>
>    NodeAddr=SRVGRIDSLURM02 NodeHostName=SRVGRIDSLURM02 Version=20.11.0
>
>    OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
>
>    RealMemory=7821 AllocMem=0 FreeMem=6900 Sockets=1 Boards=1
>
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>    Partitions=debug
>
>    BootTime=2020-11-24T08:04:32 SlurmdStartTime=2020-11-26T10:31:08
>
>    CfgTRES=cpu=4,mem=7821M,billing=4
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 AveWatts=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>    Reason=Not responding [slurm at 2020-11-26T08:27:58]
>
>    Comment=(null)
>
> NodeName=srvgridslurm03 Arch=x86_64 CoresPerSocket=4
>
>    CPUAlloc=0 CPUTot=4 CPULoad=0.01
>
>    AvailableFeatures=(null)
>
>    ActiveFeatures=(null)
>
>    Gres=(null)
>
>    NodeAddr=srvgridslurm03 NodeHostName=srvgridslurm03 Version=20.11.0
>
>    OS=Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020
>
>    RealMemory=7821 AllocMem=0 FreeMem=7170 Sockets=1 Boards=1
>
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>    Partitions=debug
>
>    BootTime=2020-11-26T09:46:49 SlurmdStartTime=2020-11-26T11:55:23
>
>    CfgTRES=cpu=4,mem=7821M,billing=4
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 AveWatts=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>    Reason=Not responding [slurm at 2020-11-26T10:00:14]
>
>    Comment=(null)
>
> Any suggestions? Thanks
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is 
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by 
> replying to this message. Please also delete all copies. Thank you. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201126/e9ecad4f/attachment-0001.htm>


More information about the slurm-users mailing list