Dear all,
I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging.
My cluster comprise one master node and 16 computing servers, organized in two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly.
From the master node, I can see that 3 servers on one of the queue appear down:
jtailleu@kandinsky:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
These servers are reachable by SSH/ping
jtailleu@kandinsky:~$ ping -c 1 FX12 PING FX12 (192.168.6.22) 56(84) bytes of data. 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms
--- FX12 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms
#####
I can also put these nodes back into idle mode:
root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 idle* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
But then, they switch back into down mode few minutes later:
root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
root@kandinsky:~# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2025-09-08T15:04:39 FX[12-14]
I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior?
Best wishes,
Julien
As the great Ole just taught us in another thread, this should tell you why:
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14]
However I suspect you'd only get "not responding" again ;-)
Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them?
On Tue, Sep 23, 2025 at 12:15 PM Julien Tailleur via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear all,
I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging.
My cluster comprise one master node and 16 computing servers, organized in two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly.
From the master node, I can see that 3 servers on one of the queue appear down:
jtailleu@kandinsky:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
These servers are reachable by SSH/ping
jtailleu@kandinsky:~$ ping -c 1 FX12 PING FX12 (192.168.6.22) 56(84) bytes of data. 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms
--- FX12 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms
#####
I can also put these nodes back into idle mode:
root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 idle* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
But then, they switch back into down mode few minutes later:
root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
root@kandinsky:~# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2025-09-08T15:04:39 FX[12-14]
I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior?
Best wishes,
Julien
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 9/23/25 16:44, Davide DelVento wrote:
As the great Ole just taught us in another thread, this should tell you why:
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14]
However I suspect you'd only get "not responding" again ;-)
Good prediction!
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User NodeName TimeStart Duration State Reason User --------------- ------------------- ------------- ------ ---------------------------------------- ---------- 2021-08-25T11:13:56 1490-12:21:12 Cluster Registered TRES FX12 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+ FX13 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+ FX14 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+
Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them?
The service were all running. "Correctly" is harder to say :-) I did not see anything obviously interesting in the logs, but I am not sure what to look for.
Anyway, I've followed your advice and rebooted the servers and they are idle for now. I will see how long it lasts. If that fixed it, I will fall on my sword and apologize for disturbing the ML...
Best,
Julien
Look at the slurmd logs on these nodes. Or try to run slurmd in non background mode.
And as I said on another thread check the time on these nodes
On Tue, Sep 23, 2025, 11:41 PM Julien Tailleur via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 9/23/25 16:44, Davide DelVento wrote:
As the great Ole just taught us in another thread, this should tell you why:
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14]
However I suspect you'd only get "not responding" again ;-)
Good prediction!
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User NodeName TimeStart Duration State Reason User
2021-08-25T11:13:56 1490-12:21:12 Cluster
Registered TRES FX12 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+ FX13 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+ FX14 2025-09-08T15:04:39 15-08:30:29 DOWN* Not responding slurm(640+
Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them?
The service were all running. "Correctly" is harder to say :-) I did not see anything obviously interesting in the logs, but I am not sure what to look for.
Anyway, I've followed your advice and rebooted the servers and they are idle for now. I will see how long it lasts. If that fixed it, I will fall on my sword and apologize for disturbing the ML...
Best,
Julien
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
First place to look IMO would be confirming connectivity on the Slurm-related ports (eg. firewall issue). I notice this is especially true when you see it work for a little while and then stop after some period of time.
Log may also tell you what’s going on.
On Sep 23, 2025, at 14:13, Julien Tailleur via slurm-users slurm-users@lists.schedmd.com wrote:
Dear all,
I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging.
My cluster comprise one master node and 16 computing servers, organized in two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly.
From the master node, I can see that 3 servers on one of the queue appear down:
jtailleu@kandinsky:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
These servers are reachable by SSH/ping
jtailleu@kandinsky:~$ ping -c 1 FX12 PING FX12 (192.168.6.22) 56(84) bytes of data. 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms
--- FX12 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms
#####
I can also put these nodes back into idle mode:
root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 idle* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
But then, they switch back into down mode few minutes later:
root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11
root@kandinsky:~# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2025-09-08T15:04:39 FX[12-14]
I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior?
Best wishes,
Julien
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com