Look at the slurmd logs on these nodes. Or try to run slurmd in non background mode.
And as I said on another thread check the time on these nodes
On Tue, Sep 23, 2025, 11:41 PM Julien Tailleur via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 9/23/25 16:44, Davide DelVento wrote:
As the great Ole just taught us in another thread, this should tell you why:
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14]
However I suspect you'd only get "not responding" again ;-)
Good prediction!
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User NodeName TimeStart Duration State Reason User
2021-08-25T11:13:56 1490-12:21:12 Cluster
Registered TRES FX12 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+ FX13 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+ FX14 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+
Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them?
The service were all running. "Correctly" is harder to say :-) I did not see anything obviously interesting in the logs, but I am not sure what to look for.
Anyway, I've followed your advice and rebooted the servers and they are idle for now. I will see how long it lasts. If that fixed it, I will fall on my sword and apologize for disturbing the ML...
Best,
Julien
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com