[slurm-users] Re: Node switching randomly to down state

23 Sep 2025


      On 9/23/25 16:44, Davide DelVento wrote:
...
As the great Ole just taught us in another thread, this should tell 
you why:
sacctmgr show event 
Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where 
nodes=FX[12-14]
However I suspect you'd only get "not responding" again ;-)
Good prediction!
sacctmgr show event 
Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User
        NodeName           TimeStart      Duration 
State                                   Reason       User
--------------- ------------------- ------------- ------ 
---------------------------------------- ----------
                 2021-08-25T11:13:56 1490-12:21:12        Cluster 
Registered TRES
FX12            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not 
responding                           slurm(640+
FX13            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not 
responding                           slurm(640+
FX14            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not 
responding                           slurm(640+
...
Are you sure that all the slurm services are running correctly on 
those servers? Maybe try rebooting them?
The service were all running. "Correctly" is harder to say :-) I did not 
see anything obviously interesting in the logs, but I am not sure what 
to look for.
Anyway, I've followed your advice and rebooted the servers and they are 
idle for now. I will see how long it lasts. If that fixed it, I will 
fall on my sword and apologize for disturbing the ML...
Best,
Julien

2026

2025

2024

[slurm-users] Re: Node switching randomly to down state