[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Cooper, Trevor tcooper at sdsc.edu
Mon Jan 15 16:10:21 MST 2018


Alessandro,

You might want to consider tracking your Slurm scheduler diagnostics output with some type of time-series monitoring system. The time-based history has proven more helpful at times than log contents by themselves.

See Giovanni Torres' post on setting this up...

    http://giovannitorres.me/graphing-sdiag-with-graphite.html

-- Trevor

> On Jan 15, 2018, at 4:33 AM, Alessandro Federico <a.federico at cineca.it> wrote:
> 
> Hi John
> 
> thanks for the info. 
> slurmctld doesn't report anything about the server thread count in the logs
> and sdiag show only 3 server threads.
> 
> We changed the MessageTimeout value to 20.
> 
> I'll let you know if it solves the problem.
> 
> Thanks
> ale
> 
> ----- Original Message -----
>> From: "John DeSantis" <desantis at usf.edu>
>> To: "Alessandro Federico" <a.federico at cineca.it>
>> Cc: slurm-users at lists.schedmd.com, "Isabella Baccarelli" <i.baccarelli at cineca.it>, hpc-sysmgt-info at cineca.it
>> Sent: Friday, January 12, 2018 7:58:38 PM
>> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation
>> 
>> Ciao Alessandro,
>> 
>>> Do we have to apply any particular setting to avoid incurring the
>>> problem?
>> 
>> What is your "MessageTimeout" value in slurm.conf?  If it's at the
>> default of 10, try changing it to 20.
>> 
>> I'd also check and see if the slurmctld log is reporting anything
>> pertaining to the server thread count being over its limit.
>> 
>> HTH,
>> John DeSantis
>> 
>> On Fri, 12 Jan 2018 11:32:57 +0100
>> Alessandro Federico <a.federico at cineca.it> wrote:
>> 
>>> Hi all,
>>> 
>>> 
>>> we are setting up SLURM 17.11.2 on a small test cluster of about
>>> 100
>>> nodes. Sometimes we get the error in the subject when running any
>>> SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)
>>> 
>>> 
>>> Do we have to apply any particular setting to avoid incurring the
>>> problem?
>>> 
>>> 
>>> We found this bug report
>>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the
>>> previous SLURM version and we do not set debug3 on slurmctld.
>>> 
>>> 
>>> thanks in advance
>>> ale
>>> 
>> 
>> 
> 
> -- 
> Alessandro Federico 
> HPC System Management Group 
> System & Technology Department 
> CINECA www.cineca.it 
> Via dei Tizii 6, 00185 Rome - Italy 
> phone: +39 06 44486708 
> 
> All work and no play makes Jack a dull boy. 
> All work and no play makes Jack a dull boy. 
> All work and no play makes Jack...
> 




More information about the slurm-users mailing list