[slurm-users] Re: slurmctld hourly: Unexpected missing socket error

29 Jul 2024

      Thanks again Patryk,
For your insights, we have implemented many of the same things, but the socket errors are still occurring regularly.
If we find a solution that works I will be sure to add it to this thread.
Many thanks
Jason
Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
My onsite days are Mon, alt Wed and Friday.
[/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidBD5FD9A2-1554-4A49-B8F2-79C2470F2C05@petermac.org.au]
Phone +61 3 8559 6546
Email Jason.Ellul@petermac.orgmailto:Jason.Ellul@petermac.org
305 Grattan Street
Melbourne, Victoria
3000 Australia
www.petermac.orghttp://www.petermac.org
[/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidEC351626-829A-4A59-AD56-7C757FE00F45@petermac.org.au]https://twitter.com/petermaccc
From: Patryk Bełzak via slurm-users slurm-users@lists.schedmd.com
Date: Wednesday, 24 July 2024 at 8:03 PM
To: Jason Ellul via slurm-users slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket error
! EXTERNAL EMAIL: Think before you click. If suspicious send to CyberReport@petermac.org
Hi,
we're on 389 directory server (aka 389ds), which is pretty large instance. One of optimizations was to create proper ACI's on server side which significantly improved lookup times on slurm controller and worker nodes. Second thing was to move sssd cache to tmpfs - instruction by RedHat: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tun...
Entire chapter 9 may be helpful.
I also remembered that recently I modified kernel to match the slurmd port range from slurm.conf (60000-63001) by creating file /etc/sysctl.d/91-slurm.conf with following content:
# set ipv4 port range accordingly to slurmdPortRange in slurm.conf
net.ipv4.ip_local_port_range = 32768    63001
Unfortunately it hasn't stopped the error from occuring.
Best regards,
Patryk.
On 24/07/23 12:08, Jason Ellul via slurm-users wrote:
[-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 6,8K --]
...
Hi Patryk,
Thanks so much for your email.
There are a couple of things you list that we have not tried yet so we will definitely look at them. You mention optimizing SSSD which has me curious, are you using RedHat Identity management (free IPA?) because we are and after going through our logs it appears the errors became more consistent after upgrading our instance and replica to REHL9.
May I please ask what optimizations did you put in place for SSSD?
Many thanks
Jason
Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
My onsite days are Mon, alt Wed and Friday.
[/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidBD5FD9A2-1554-4A49-B8F2-79C2470F2C05@petermac.org.au]
Phone +61 3 8559 6546
Email Jason.Ellul@petermac.orgmailto:Jason.Ellul@petermac.org
305 Grattan Street
Melbourne, Victoria
3000 Australia
www.petermac.orghttp://www.petermac.org
[/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidEC351626-829A-4A59-AD56-7C757FE00F45@petermac.org.au]https://twitter.com/petermaccc
From: Patryk Bełzak via slurm-users slurm-users@lists.schedmd.com
Date: Monday, 22 July 2024 at 6:03 PM
To: Jason Ellul via slurm-users slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket error
! EXTERNAL EMAIL: Think before you click. If suspicious send to CyberReport@petermac.org
Hi,
we've been facing the same issue for some time. At the beginning the missing socket error happened every 20 minutes, later once per hour, now it happens few times a day.
The only downside of this was that controller was unresponsive for that couple of seconds - up to 60, if I remember well.
We tried to debug it in many ways, but we've found no straightforward solution or source of problems.
Things we've changed since the problem came up:

RPC user limit: `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384`
made sure that VM that slurm runs on has "network-latency" profile in `tuned`, also the same profile on worker nodes
implemented some of these recommendations https://slurm.schedmd.com/high_throughput.html on controllers
largely optimized slurmdb by some housekeeping and cleaning up inactive accounts, associations etc.
optimized SSSD configuration (this one I believe had the biggest impact) both on controllers and on worker nodes

plus plenty of other (not related I guess) changes.
I'm not really sure if any of above helped us significantly in that matter.
Best regards,
Patryk Belzak.
On 24/07/16 03:45, Jason Ellul via slurm-users wrote:
[-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 2,0K --]
...
Hi all,
I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as:
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have:
Set:

SchedulerParameters     = max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600

SlurmctldPort           = 6808-6817
But although the stats in sdiag have improved we still see the errors.
On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller.
Many Thanks in advance
Jason
Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Centre
[-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: quoted-printable, Size: 6,9K --]
...
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
[-- Alternative Type #1: text/html; charset=utf-8, Encoding: base64, Size: 14K --]
...
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2026

2025

2024

[slurm-users] Re: slurmctld hourly: Unexpected missing socket error