[slurm-users] Socket Timed Out on Send/Recv Operation

Janne Blomqvist janne.blomqvist at aalto.fi
Thu Apr 18 07:14:13 UTC 2019


On 17/04/2019 18.54, Yang Liu wrote:
> We often received errors due to socket time out on send/recv opeartion:
> 
> slurm_load_jobs error: Socket timed out on send/recv operation
> slurm_load_node: Socket timed out on send/recv operation
> 
> 
> What could cause the errors? How likely job_submit.lua could cause such
> errors? We have a program running every 2 seconds collect information of
> pending jobs. Does that program cause the errors?

Maybe the slurm controller is overloaded, so in that case every load
that you reduce helps.

However, even if the controller isn't generally overloaded, there can
still be occasional spikes causing these kinds of issues. We used to
suffer from these errors as well, in our case it was enough to bump
somaxconn and tcp_max_syn_backlog (we use 4096 for both). See also
https://slurm.schedmd.com/high_throughput.html


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqvist at aalto.fi



More information about the slurm-users mailing list