[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Alessandro Federico a.federico at cineca.it
Tue Jan 16 03:22:44 MST 2018


Hi,

setting MessageTimeout to 20 didn't solve it :(

looking at slurmctld logs I noticed many warning like these

Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from _slurm_rpc_dump_partitions: usec=42850604 began=05:10:17.289
Jan 16 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from load_part_uid_allow_list: usec=44861325 began=05:20:13.257
Jan 16 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from _slurmctld_background: usec=44861653 began=05:20:13.257

they are generated in many functions:

[root at r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16 00:00:00'  | grep -oP 'Note very large processing time from \w+:' | sort | uniq -c
      4 Note very large processing time from dump_all_job_state:
     67 Note very large processing time from load_part_uid_allow_list:
     67 Note very large processing time from _slurmctld_background:
      7 Note very large processing time from _slurm_rpc_complete_batch_script:
      4 Note very large processing time from _slurm_rpc_dump_jobs:
      3 Note very large processing time from _slurm_rpc_dump_job_user:
    271 Note very large processing time from _slurm_rpc_dump_partitions:
      5 Note very large processing time from _slurm_rpc_epilog_complete:
      1 Note very large processing time from _slurm_rpc_job_pack_alloc_info:
      3 Note very large processing time from _slurm_rpc_step_complete:

processing times are always around tens of seconds.

I'm attaching sdiag output and slurm.conf.

thanks
ale

----- Original Message -----
> From: "Trevor Cooper" <tcooper at sdsc.edu>
> To: "Slurm User Community List" <slurm-users at lists.schedmd.com>
> Sent: Tuesday, January 16, 2018 12:10:21 AM
> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation
> 
> Alessandro,
> 
> You might want to consider tracking your Slurm scheduler diagnostics
> output with some type of time-series monitoring system. The
> time-based history has proven more helpful at times than log
> contents by themselves.
> 
> See Giovanni Torres' post on setting this up...
> 
>     http://giovannitorres.me/graphing-sdiag-with-graphite.html
> 
> -- Trevor
> 
> > On Jan 15, 2018, at 4:33 AM, Alessandro Federico
> > <a.federico at cineca.it> wrote:
> > 
> > Hi John
> > 
> > thanks for the info.
> > slurmctld doesn't report anything about the server thread count in
> > the logs
> > and sdiag show only 3 server threads.
> > 
> > We changed the MessageTimeout value to 20.
> > 
> > I'll let you know if it solves the problem.
> > 
> > Thanks
> > ale
> > 
> > ----- Original Message -----
> >> From: "John DeSantis" <desantis at usf.edu>
> >> To: "Alessandro Federico" <a.federico at cineca.it>
> >> Cc: slurm-users at lists.schedmd.com, "Isabella Baccarelli"
> >> <i.baccarelli at cineca.it>, hpc-sysmgt-info at cineca.it
> >> Sent: Friday, January 12, 2018 7:58:38 PM
> >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> >> send/recv operation
> >> 
> >> Ciao Alessandro,
> >> 
> >>> Do we have to apply any particular setting to avoid incurring the
> >>> problem?
> >> 
> >> What is your "MessageTimeout" value in slurm.conf?  If it's at the
> >> default of 10, try changing it to 20.
> >> 
> >> I'd also check and see if the slurmctld log is reporting anything
> >> pertaining to the server thread count being over its limit.
> >> 
> >> HTH,
> >> John DeSantis
> >> 
> >> On Fri, 12 Jan 2018 11:32:57 +0100
> >> Alessandro Federico <a.federico at cineca.it> wrote:
> >> 
> >>> Hi all,
> >>> 
> >>> 
> >>> we are setting up SLURM 17.11.2 on a small test cluster of about
> >>> 100
> >>> nodes. Sometimes we get the error in the subject when running any
> >>> SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)
> >>> 
> >>> 
> >>> Do we have to apply any particular setting to avoid incurring the
> >>> problem?
> >>> 
> >>> 
> >>> We found this bug report
> >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the
> >>> previous SLURM version and we do not set debug3 on slurmctld.
> >>> 
> >>> 
> >>> thanks in advance
> >>> ale
> >>> 
> >> 
> >> 
> > 
> > --
> > Alessandro Federico
> > HPC System Management Group
> > System & Technology Department
> > CINECA www.cineca.it
> > Via dei Tizii 6, 00185 Rome - Italy
> > phone: +39 06 44486708
> > 
> > All work and no play makes Jack a dull boy.
> > All work and no play makes Jack a dull boy.
> > All work and no play makes Jack...
> > 
> 
> 
> 

-- 
Alessandro Federico 
HPC System Management Group 
System & Technology Department 
CINECA www.cineca.it 
Via dei Tizii 6, 00185 Rome - Italy 
phone: +39 06 44486708 

All work and no play makes Jack a dull boy. 
All work and no play makes Jack a dull boy. 
All work and no play makes Jack...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sdiag.out
Type: application/octet-stream
Size: 10653 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180116/3507ba89/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 9426 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180116/3507ba89/attachment-0003.obj>


More information about the slurm-users mailing list