[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Tue Jan 16 11:13:06 MST 2018

Hi,

In this kind if issues, one good thing to do is to get a backtrace of
slurmctld during the slowdown. You should thus easily identify the
subcomponent responsible for the issue.

I would bet on something like LDAP requests taking too much time because of
a missing sssd cache.

Regards
Matthieu

Le 16 janv. 2018 18:59, "John DeSantis" <desantis at usf.edu> a écrit :

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> Ciao Alessandro,
>
> > setting MessageTimeout to 20 didn't solve it :(
> >
> > looking at slurmctld logs I noticed many warning like these
> >
> > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
> > processing time from _slurm_rpc_dump_partitions: usec=42850604
> > began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:
> > Warning: Note very large processing time from
> > load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16
> > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> > processing time from _slurmctld_background: usec=44861653
> > began=05:20:13.257
>
> And:
>
> > 271 Note very large processing time from _slurm_rpc_dump_partitions:
> > 67 Note very large processing time from load_part_uid_allow_list:
>
> I believe these values are in microseconds, so an average of 44 seconds
> per call, mostly related to partition information.  Given that our
> configuration has the maximum value set of 90 seconds, I'd again
> recommend another adjustment, perhaps to 60 seconds.
>
> I'm not sure if redefining your partitions will help, but you do have
> several partitions which contain the same set of nodes that could be
> condensed - decreasing the amount of partitions.  For example, the
> partitions bdw_all_serial & bdw_all_rcm could be consolidated into a
> single partition by:
>
> 1.)  Using AllowQOS=bdw_all_serial,bdw_all_rcm;
> 2.)  Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS
> (since one partition has 04:00:00 and the other 03:00:00).
>
> The same could be done for the partitions skl_fua_{prod,bprod,lprod} as
> well.
>
> HTH,
> John DeSantis
>
>
> On Tue, 16 Jan 2018 11:22:44 +0100
> Alessandro Federico <a.federico at cineca.it> wrote:
>
> > Hi,
> >
> > setting MessageTimeout to 20 didn't solve it :(
> >
> > looking at slurmctld logs I noticed many warning like these
> >
> > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
> > processing time from _slurm_rpc_dump_partitions: usec=42850604
> > began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:
> > Warning: Note very large processing time from
> > load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16
> > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> > processing time from _slurmctld_background: usec=44861653
> > began=05:20:13.257
> >
> > they are generated in many functions:
> >
> > [root at r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16
> > 00:00:00'  | grep -oP 'Note very large processing time from \w+:' |
> > sort | uniq -c 4 Note very large processing time from
> > dump_all_job_state: 67 Note very large processing time from
> > load_part_uid_allow_list: 67 Note very large processing time from
> > _slurmctld_background: 7 Note very large processing time from
> > _slurm_rpc_complete_batch_script: 4 Note very large processing time
> > from _slurm_rpc_dump_jobs: 3 Note very large processing time from
> > _slurm_rpc_dump_job_user: 271 Note very large processing time from
> > _slurm_rpc_dump_partitions: 5 Note very large processing time from
> > _slurm_rpc_epilog_complete: 1 Note very large processing time from
> > _slurm_rpc_job_pack_alloc_info: 3 Note very large processing time
> > from _slurm_rpc_step_complete:
> >
> > processing times are always around tens of seconds.
> >
> > I'm attaching sdiag output and slurm.conf.
> >
> > thanks
> > ale
> >
> > ----- Original Message -----
> > > From: "Trevor Cooper" <tcooper at sdsc.edu>
> > > To: "Slurm User Community List" <slurm-users at lists.schedmd.com>
> > > Sent: Tuesday, January 16, 2018 12:10:21 AM
> > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> > > send/recv operation
> > >
> > > Alessandro,
> > >
> > > You might want to consider tracking your Slurm scheduler diagnostics
> > > output with some type of time-series monitoring system. The
> > > time-based history has proven more helpful at times than log
> > > contents by themselves.
> > >
> > > See Giovanni Torres' post on setting this up...
> > >
> > >     http://giovannitorres.me/graphing-sdiag-with-graphite.html
> > >
> > > -- Trevor
> > >
> > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico
> > > > <a.federico at cineca.it> wrote:
> > > >
> > > > Hi John
> > > >
> > > > thanks for the info.
> > > > slurmctld doesn't report anything about the server thread count in
> > > > the logs
> > > > and sdiag show only 3 server threads.
> > > >
> > > > We changed the MessageTimeout value to 20.
> > > >
> > > > I'll let you know if it solves the problem.
> > > >
> > > > Thanks
> > > > ale
> > > >
> > > > ----- Original Message -----
> > > >> From: "John DeSantis" <desantis at usf.edu>
> > > >> To: "Alessandro Federico" <a.federico at cineca.it>
> > > >> Cc: slurm-users at lists.schedmd.com, "Isabella Baccarelli"
> > > >> <i.baccarelli at cineca.it>, hpc-sysmgt-info at cineca.it
> > > >> Sent: Friday, January 12, 2018 7:58:38 PM
> > > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> > > >> send/recv operation
> > > >>
> > > >> Ciao Alessandro,
> > > >>
> > > >>> Do we have to apply any particular setting to avoid incurring
> > > >>> the problem?
> > > >>
> > > >> What is your "MessageTimeout" value in slurm.conf?  If it's at
> > > >> the default of 10, try changing it to 20.
> > > >>
> > > >> I'd also check and see if the slurmctld log is reporting anything
> > > >> pertaining to the server thread count being over its limit.
> > > >>
> > > >> HTH,
> > > >> John DeSantis
> > > >>
> > > >> On Fri, 12 Jan 2018 11:32:57 +0100
> > > >> Alessandro Federico <a.federico at cineca.it> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>>
> > > >>> we are setting up SLURM 17.11.2 on a small test cluster of about
> > > >>> 100
> > > >>> nodes. Sometimes we get the error in the subject when running
> > > >>> any SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)
> > > >>>
> > > >>>
> > > >>> Do we have to apply any particular setting to avoid incurring
> > > >>> the problem?
> > > >>>
> > > >>>
> > > >>> We found this bug report
> > > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the
> > > >>> previous SLURM version and we do not set debug3 on slurmctld.
> > > >>>
> > > >>>
> > > >>> thanks in advance
> > > >>> ale
> > > >>>
> > > >>
> > > >>
> > > >
> > > > --
> > > > Alessandro Federico
> > > > HPC System Management Group
> > > > System & Technology Department
> > > > CINECA www.cineca.it
> > > > Via dei Tizii 6, 00185 Rome - Italy
> > > > phone: +39 06 44486708
> > > >
> > > > All work and no play makes Jack a dull boy.
> > > > All work and no play makes Jack a dull boy.
> > > > All work and no play makes Jack...
> > > >
> > >
> > >
> > >
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2
>
> iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6
> vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD
> NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4
> KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB
> xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h
> /93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs=
> =VWA7
> -----END PGP SIGNATURE-----
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180116/3878d7de/attachment.html>