Hello,

we are running a SLURM-managed cluster with one control node (g-vm03) and 26 worker nodes (ouga[03-28]) on Rocky 8. We recently updated from 20.11.9 through 23.02.8 to 24.11.0 and then 24.11.5. Since then, we are experiencing performance issues - squeue and scontrol ping are slow to react and sometimes deliver "timeout on send/recv" messages, even with only very few parallel requests. We did not experience these issues with SLURM 20.11.9 before, we did not check the intermediate version 23.02.8 in detail before. In the log of slurmctld, we can also find messages like

slurmctld: error: slurm_send_node_msg: [socket:[1272743]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error

We thus implemented all recommendations from the high throughput documentation, and did achieve improvements with it (most notably by increasing the maximum number of open files and increasing MessageTimeout and TCPTimeout).

For debugging, I attached the slurm.conf, the sdiag output (the server thread count is almost always 1 and sometimes increases to 2), the slurmctld log and the slurmdbd log from a time of high load.

We would be very thankful for any input on how restore the old performance.

Kind Regards,
Tilman Hoffbauer