[slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
Steve Bland
sbland at rossvideo.com
Mon Nov 30 15:37:49 UTC 2020
Although, in testing, even with ReturnToService set to '1', on a restart the system sees the node has come back in the logs, but it is still classified as down so will not take jobs until manually told otherwise
[2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM01
[2020-11-30T10:33:05.402] debug2: node_did_resp srvgridslurm03
[2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM02
There has to be a way around this manual intervention
thanks
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Steve Bland
Sent: Monday, November 30, 2020 08:12
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
Thanks Chris
When I did that, they all came back.
Also found that in slurm.conf, ReturnToService was set to 0, so modified that for now. May turn it back to 0 to see if any nodes are lost, but I assume that will be in the log
Interestingly I had this in slurm.conf, thought that would make the initial state up for all
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Steve Bland
Technical Product Manager
Third Party Products
Ross Video | Production Technology Experts
T: +1 (613) 228-0688 ext.4219
www.rossvideo.com<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.rossvideo.com%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cb8ed1faa8a834674670308d89531f492%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C0%7C637423389078612061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BZowNlheVAOKYa7cpTFi6VJx5Gf6iJ2T9n5Ug4kjxwk%3D&reserved=0>
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Chris Samuel <chris at csamuel.org<mailto:chris at csamuel.org>>
Sent: 27 November 2020 15:02
To: slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com> <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
On 26/11/20 9:21 am, Steve Bland wrote:
> Sinfo always returns nodes not responding
One thing - do the nodes return to this state when you resume them with
"scontrol update node=srvgridslurm[01-03] state=resume" ?
If they do then what does your slurmctld logs say for the reason for this?
You can bump up the log level on your slurmctld with (for instance
"scontrol setdebug debug" for more info (we run ours at debug all the
time anyway).
All the best,
Chris
--
Chris Samuel : https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cd08447ff5072423ef86f08d8930fa82d%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C1%7C637421042744008756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=x5GjoV0mij7cMOciZv7w3wBH%2FEGONoV3i0fUDqoeRlI%3D&reserved=0<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cb8ed1faa8a834674670308d89531f492%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C0%7C637423389078622059%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QPAEm%2FzaZg%2FNKzwzRI4EqHRVHv%2FtQ3V3M4DwK%2B2R5Ck%3D&reserved=0> : Berkeley, CA, USA
----------------------------------------------
This e-mail and any attachments may contain information that is confidential to Ross Video.
If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.
----------------------------------------------
This e-mail and any attachments may contain information that is confidential to Ross Video.
If you are not the intended recipient, please notify me immediately by replying to this message. Please also delete all copies. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201130/5f0c96bd/attachment-0001.htm>
More information about the slurm-users
mailing list