[slurm-users] [EXT] Re: [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Sean Crosby scrosby at unimelb.edu.au
Mon Nov 30 16:15:07 UTC 2020


You showed that firewalld is off, but that doesn't really prove on
Centos7/RHEL7 that there is no firewall.

What is the output of

iptables -S

I'd also try doing

# scontrol show config | grep -i SlurmdPort
SlurmdPort              = 6818

And whatever port is shown, from the compute nodes, try communicating with
the other Slurmd's

e.g. from SRVGRIDSLURM01 do

nc -z SRVGRIDSLURM02 6818 || echo Cannot communicate
nc -z srvgridslurm03 6818 || echo Cannot communicate

Replace 6818 with the port you get from the scontrol show config command
earlier

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 1 Dec 2020 at 02:37, Steve Bland <sbland at rossvideo.com> wrote:

> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> ------------------------------
>
> Although, in testing, even with ReturnToService set to ‘1’, on a restart
> the system sees the node has come back in the logs, but it is still
> classified as down so will not take jobs until manually told otherwise
>
>
>
>
>
> [2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM01
>
> [2020-11-30T10:33:05.402] debug2: node_did_resp srvgridslurm03
>
> [2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM02
>
>
>
> There has to be a way around this manual intervention
>
>
>
> thanks
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Steve Bland
> *Sent:* Monday, November 30, 2020 08:12
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a
> connectivity issue between the slurmctld process and the slurmd nodes
>
>
>
> Thanks Chris
>
>
>
> When I did that, they all came back.
>
>
>
> Also found that in slurm.conf*, *ReturnToService was set to 0, so
> modified that for now. May turn it back to 0 to see if any nodes are lost,
> but I assume that will be in the log
>
>
>
> Interestingly I had this in slurm.conf, thought that would make the
> initial state up for all
>
>
>
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
>
>
>
>
> *Steve Bland*
> *Technical Product Manager*
>
> *Third Party Products*
> Ross Video | Production Technology Experts
> T: +1 (613) 228-0688 ext.4219
> www.rossvideo.com
> <https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.rossvideo.com%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cb8ed1faa8a834674670308d89531f492%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C0%7C637423389078612061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BZowNlheVAOKYa7cpTFi6VJx5Gf6iJ2T9n5Ug4kjxwk%3D&reserved=0>
> ------------------------------
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Chris Samuel <chris at csamuel.org>
> *Sent:* 27 November 2020 15:02
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity
> issue between the slurmctld process and the slurmd nodes
>
>
>
> On 26/11/20 9:21 am, Steve Bland wrote:
>
> > Sinfo always returns nodes not responding
>
> One thing - do the nodes return to this state when you resume them with
> "scontrol update node=srvgridslurm[01-03] state=resume" ?
>
> If they do then what does your slurmctld logs say for the reason for this?
>
> You can bump up the log level on your slurmctld with (for instance
> "scontrol setdebug debug" for more info (we run ours at debug all the
> time anyway).
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cd08447ff5072423ef86f08d8930fa82d%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C1%7C637421042744008756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=x5GjoV0mij7cMOciZv7w3wBH%2FEGONoV3i0fUDqoeRlI%3D&reserved=0
> <https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cb8ed1faa8a834674670308d89531f492%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C0%7C637423389078622059%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QPAEm%2FzaZg%2FNKzwzRI4EqHRVHv%2FtQ3V3M4DwK%2B2R5Ck%3D&reserved=0>
> :  Berkeley, CA, USA
>
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
> ----------------------------------------------
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201201/296b445a/attachment.htm>


More information about the slurm-users mailing list