Hi all,
I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as:
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have: Set:
* SchedulerParameters = max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600 * SlurmctldPort = 6808-6817
But although the stats in sdiag have improved we still see the errors.
On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller.
Many Thanks in advance
Jason
Jason Ellul Head - Research Computing Facility Office of Cancer Research Peter MacCallum Cancer Centre