Hi, we've been facing the same issue for some time. At the beginning the missing socket error happened every 20 minutes, later once per hour, now it happens few times a day. The only downside of this was that controller was unresponsive for that couple of seconds - up to 60, if I remember well. We tried to debug it in many ways, but we've found no straightforward solution or source of problems.
Things we've changed since the problem came up: * RPC user limit: `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384` * made sure that VM that slurm runs on has "network-latency" profile in `tuned`, also the same profile on worker nodes * implemented some of these recommendations https://slurm.schedmd.com/high_throughput.html on controllers * largely optimized slurmdb by some housekeeping and cleaning up inactive accounts, associations etc. * optimized SSSD configuration (this one I believe had the biggest impact) both on controllers and on worker nodes plus plenty of other (not related I guess) changes.
I'm not really sure if any of above helped us significantly in that matter.
Best regards, Patryk Belzak.
On 24/07/16 03:45, Jason Ellul via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 2,0K --]
Hi all,
I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as:
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have: Set:
SchedulerParameters = max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
SlurmctldPort = 6808-6817
But although the stats in sdiag have improved we still see the errors.
On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller.
Many Thanks in advance
Jason
Jason Ellul Head - Research Computing Facility Office of Cancer Research Peter MacCallum Cancer Centre
[-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: quoted-printable, Size: 6,9K --]
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com