[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Tue Jan 16 10:54:55 MST 2018

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Ciao Alessandro,

> setting MessageTimeout to 20 didn't solve it :(
> 
> looking at slurmctld logs I noticed many warning like these
> 
> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurm_rpc_dump_partitions: usec=42850604
> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:
> Warning: Note very large processing time from
> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16
> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurmctld_background: usec=44861653
> began=05:20:13.257

And:

> 271 Note very large processing time from _slurm_rpc_dump_partitions:
> 67 Note very large processing time from load_part_uid_allow_list:

I believe these values are in microseconds, so an average of 44 seconds
per call, mostly related to partition information.  Given that our
configuration has the maximum value set of 90 seconds, I'd again
recommend another adjustment, perhaps to 60 seconds. 

I'm not sure if redefining your partitions will help, but you do have
several partitions which contain the same set of nodes that could be
condensed - decreasing the amount of partitions.  For example, the
partitions bdw_all_serial & bdw_all_rcm could be consolidated into a
single partition by:

1.)  Using AllowQOS=bdw_all_serial,bdw_all_rcm;
2.)  Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS
(since one partition has 04:00:00 and the other 03:00:00).

The same could be done for the partitions skl_fua_{prod,bprod,lprod} as
well.

HTH,
John DeSantis

On Tue, 16 Jan 2018 11:22:44 +0100
Alessandro Federico <a.federico at cineca.it> wrote:

> Hi,
> 
> setting MessageTimeout to 20 didn't solve it :(
> 
> looking at slurmctld logs I noticed many warning like these
> 
> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurm_rpc_dump_partitions: usec=42850604
> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:
> Warning: Note very large processing time from
> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16
> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurmctld_background: usec=44861653
> began=05:20:13.257
> 
> they are generated in many functions:
> 
> [root at r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16
> 00:00:00'  | grep -oP 'Note very large processing time from \w+:' |
> sort | uniq -c 4 Note very large processing time from
> dump_all_job_state: 67 Note very large processing time from
> load_part_uid_allow_list: 67 Note very large processing time from
> _slurmctld_background: 7 Note very large processing time from
> _slurm_rpc_complete_batch_script: 4 Note very large processing time
> from _slurm_rpc_dump_jobs: 3 Note very large processing time from
> _slurm_rpc_dump_job_user: 271 Note very large processing time from
> _slurm_rpc_dump_partitions: 5 Note very large processing time from
> _slurm_rpc_epilog_complete: 1 Note very large processing time from
> _slurm_rpc_job_pack_alloc_info: 3 Note very large processing time
> from _slurm_rpc_step_complete:
> 
> processing times are always around tens of seconds.
> 
> I'm attaching sdiag output and slurm.conf.
> 
> thanks
> ale
> 
> ----- Original Message -----
> > From: "Trevor Cooper" <tcooper at sdsc.edu>
> > To: "Slurm User Community List" <slurm-users at lists.schedmd.com>
> > Sent: Tuesday, January 16, 2018 12:10:21 AM
> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> > send/recv operation
> > 
> > Alessandro,
> > 
> > You might want to consider tracking your Slurm scheduler diagnostics
> > output with some type of time-series monitoring system. The
> > time-based history has proven more helpful at times than log
> > contents by themselves.
> > 
> > See Giovanni Torres' post on setting this up...
> > 
> >     http://giovannitorres.me/graphing-sdiag-with-graphite.html
> > 
> > -- Trevor
> > 
> > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico
> > > <a.federico at cineca.it> wrote:
> > > 
> > > Hi John
> > > 
> > > thanks for the info.
> > > slurmctld doesn't report anything about the server thread count in
> > > the logs
> > > and sdiag show only 3 server threads.
> > > 
> > > We changed the MessageTimeout value to 20.
> > > 
> > > I'll let you know if it solves the problem.
> > > 
> > > Thanks
> > > ale
> > > 
> > > ----- Original Message -----
> > >> From: "John DeSantis" <desantis at usf.edu>
> > >> To: "Alessandro Federico" <a.federico at cineca.it>
> > >> Cc: slurm-users at lists.schedmd.com, "Isabella Baccarelli"
> > >> <i.baccarelli at cineca.it>, hpc-sysmgt-info at cineca.it
> > >> Sent: Friday, January 12, 2018 7:58:38 PM
> > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> > >> send/recv operation
> > >> 
> > >> Ciao Alessandro,
> > >> 
> > >>> Do we have to apply any particular setting to avoid incurring
> > >>> the problem?
> > >> 
> > >> What is your "MessageTimeout" value in slurm.conf?  If it's at
> > >> the default of 10, try changing it to 20.
> > >> 
> > >> I'd also check and see if the slurmctld log is reporting anything
> > >> pertaining to the server thread count being over its limit.
> > >> 
> > >> HTH,
> > >> John DeSantis
> > >> 
> > >> On Fri, 12 Jan 2018 11:32:57 +0100
> > >> Alessandro Federico <a.federico at cineca.it> wrote:
> > >> 
> > >>> Hi all,
> > >>> 
> > >>> 
> > >>> we are setting up SLURM 17.11.2 on a small test cluster of about
> > >>> 100
> > >>> nodes. Sometimes we get the error in the subject when running
> > >>> any SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)
> > >>> 
> > >>> 
> > >>> Do we have to apply any particular setting to avoid incurring
> > >>> the problem?
> > >>> 
> > >>> 
> > >>> We found this bug report
> > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the
> > >>> previous SLURM version and we do not set debug3 on slurmctld.
> > >>> 
> > >>> 
> > >>> thanks in advance
> > >>> ale
> > >>> 
> > >> 
> > >> 
> > > 
> > > --
> > > Alessandro Federico
> > > HPC System Management Group
> > > System & Technology Department
> > > CINECA www.cineca.it
> > > Via dei Tizii 6, 00185 Rome - Italy
> > > phone: +39 06 44486708
> > > 
> > > All work and no play makes Jack a dull boy.
> > > All work and no play makes Jack a dull boy.
> > > All work and no play makes Jack...
> > > 
> > 
> > 
> > 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6
vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD
NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4
KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB
xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h
/93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs=
=VWA7
-----END PGP SIGNATURE-----