[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Tue Jan 23 15:23:23 MST 2018

We put SSSD caches on a RAMDISK which helped a little bit with performance.

----- On 22 Jan, 2018, at 02:38, Alessandro Federico a.federico at cineca.it wrote:

| Hi John,
| 
| just an update...
| we not have a solution for the SSSD issue yet, but we changed the ACL
| on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the
| slowdown has gone.
| 
| Thanks for the help
| ale
| 
| ----- Original Message -----
|> From: "Alessandro Federico" <a.federico at cineca.it>
|> To: "John DeSantis" <desantis at usf.edu>
|> Cc: hpc-sysmgt-info at cineca.it, "Slurm User Community List"
|> <slurm-users at lists.schedmd.com>, "Isabella Baccarelli"
|> <i.baccarelli at cineca.it>
|> Sent: Wednesday, January 17, 2018 5:41:54 PM
|> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv
|> operation
|> 
|> Hi John
|> 
|> thanks for the infos.
|> We are investigating the slowdown of sssd and I found some bug
|> reports regarding slow sssd query
|> with almost the same backtrace. Hopefully an update of sssd could
|> solve this issue.
|> 
|> We'll let you know if we found a solution.
|> 
|> thanks
|> ale
|> 
|> ----- Original Message -----
|> > From: "John DeSantis" <desantis at usf.edu>
|> > To: "Alessandro Federico" <a.federico at cineca.it>
|> > Cc: "Slurm User Community List" <slurm-users at lists.schedmd.com>,
|> > "Isabella Baccarelli" <i.baccarelli at cineca.it>,
|> > hpc-sysmgt-info at cineca.it
|> > Sent: Wednesday, January 17, 2018 3:30:43 PM
|> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
|> > send/recv operation
|> > 
|> > Ale,
|> > 
|> > > As Matthieu said it seems something related to SSS daemon.
|> > 
|> > That was a great catch by Matthieu.
|> > 
|> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL
|> > 
|> > Correct, which may seem negligent, but after each `scontrol
|> > reconfigure`, slurmctld restart, and/or AllowGroups= partition
|> > update,
|> > the mapping of UID's for each group will be updated.
|> > 
|> > > So why does the UID-GID mapping take so long?
|> > 
|> > We attempted to use "AllowGroups" previously, but we found (even
|> > with
|> > sssd.conf tuning regarding enumeration) that unless the group was
|> > local
|> > (/etc/group), we were experiencing delays before the AllowGroups
|> > parameter was respected.  This is why we opted to use SLURM's
|> > AllowQOS/AllowAccounts instead.
|> > 
|> > You would have to enable debugging on your remote authentication
|> > software to see where the hang-up is occurring (if it is that at
|> > all,
|> > and not just a delay with the slurmctld).
|> > 
|> > Given the direction that this is going - why not replace the
|> > "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?
|> > 
|> > > @John: we defined many partitions on the same nodes but in the
|> > > production cluster they will be more or less split across the 6K
|> > > nodes.
|> > 
|> > Ok, that makes sense.  Looking initially at your partition
|> > definitions,
|> > I immediately thought of being DRY, especially since the "finer"
|> > tuning
|> > between the partitions could easily be controlled via the QOS'
|> > allowed
|> > to access the resources.
|> > 
|> > John DeSantis
|> > 
|> > On Wed, 17 Jan 2018 13:20:49 +0100
|> > Alessandro Federico <a.federico at cineca.it> wrote:
|> > 
|> > > Hi Matthieu & John
|> > > 
|> > > this is the backtrace of slurmctld during the slowdown
|> > > 
|> > > (gdb) bt
|> > > #0  0x00007fb0e8b1e69d in poll () from /lib64/libc.so.6
|> > > #1  0x00007fb0e8617bfa in sss_cli_make_request_nochecks ()
|> > > from /lib64/libnss_sss.so.2 #2  0x00007fb0e86185a3 in
|> > > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3
|> > > 0x00007fb0e8619104 in _nss_sss_getpwnam_r ()
|> > > from /lib64/libnss_sss.so.2 #4  0x00007fb0e8aef07d in
|> > > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5
|> > > 0x00007fb0e9360986 in _getpwnam_r (result=<optimized out>,
|> > > bufsiz=<optimized out>, buf=<optimized out>, pwd=<optimized out>,
|> > > name=<optimized out>) at uid.c:73 #6  uid_from_string
|> > > (name=0x1820e41
|> > > "g2bottin", uidp=uidp at entry=0x7fff07f03a6c) at uid.c:111 #7
|> > > 0x000000000043587d in get_group_members (group_name=0x10ac500
|> > > "g2")
|> > > at groups.c:139 #8  0x000000000047525a in _get_groups_members
|> > > (group_names=<optimized out>) at partition_mgr.c:2006 #9
|> > > 0x0000000000475505 in _update_part_uid_access_list
|> > > (x=0x7fb03401e650,
|> > > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10
|> > > 0x00007fb0e92ab675
|> > > in
|> > > list_for_each (l=0x1763e50, f=f at entry=0x4754d8
|> > > <_update_part_uid_access_list>, arg=arg at entry=0x7fff07f13bf4) at
|> > > list.c:420 #11 0x000000000047911a in load_part_uid_allow_list
|> > > (force=1) at partition_mgr.c:1971 #12 0x0000000000428e5c in
|> > > _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main
|> > > (argc=<optimized out>, argv=<optimized out>) at controller.c:601
|> > > 
|> > > As Matthieu said it seems something related to SSS daemon.
|> > > However we don't notice any slowdown due to SSSd in our
|> > > environment.
|> > > As I told you before, we are just testing SLURM on a small 100
|> > > nodes
|> > > cluster before going into production with about 6000 nodes next
|> > > Wednesday. At present the other nodes are managed by PBSPro and
|> > > the
|> > > 2
|> > > PBS servers are running on the same nodes as the SLURM
|> > > controllers.
|> > > PBS queues are also configured with users/groups ACLs and we
|> > > never
|> > > noticed any similar slowdown.
|> > > 
|> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL
|> > > 
|> > > [root at mgmt01 slurm]# grep AllowGroups slurm.conf
|> > > PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO
|> > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00  State=UP
|> > > QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2
|> > > PartitionName=bdw_fua_gw    Nodes=r040c03s0[1,2] Default=NO
|> > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00  State=UP
|> > > QOS=bdw_fua_gw    DenyQos=bdw_qos_special AllowGroups=g2
|> > > PartitionName=bdw_fua_gwg2  Nodes=r040c03s0[1,2] Default=NO
|> > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP
|> > > QOS=bdw_fua_gwg2  DenyQos=bdw_qos_special AllowGroups=g2
|> > > 
|> > > So why does the UID-GID mapping take so long?
|> > > 
|> > > @John: we defined many partitions on the same nodes but in the
|> > > production cluster they will be more or less split across the 6K
|> > > nodes.
|> > > 
|> > > thank you very much
|> > > ale
|> > > 
|> > > ----- Original Message -----
|> > > > From: "John DeSantis" <desantis at usf.edu>
|> > > > To: "Matthieu Hautreux" <matthieu.hautreux at gmail.com>
|> > > > Cc: hpc-sysmgt-info at cineca.it, "Slurm User Community List"
|> > > > <slurm-users at lists.schedmd.com>, "Isabella Baccarelli"
|> > > > <i.baccarelli at cineca.it> Sent: Tuesday, January 16, 2018
|> > > > 8:20:20
|> > > > PM
|> > > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
|> > > > send/recv operation
|> > > > 
|> > > > Matthieu,
|> > > > 
|> > > > > I would bet on something like LDAP requests taking too much
|> > > > > time
|> > > > > because of a missing sssd cache.
|> > > > 
|> > > > Good point!  It's easy to forget to check something as "simple"
|> > > > as
|> > > > user
|> > > > look-up when something is taking "too long".
|> > > > 
|> > > > John DeSantis
|> > > > 
|> > > > On Tue, 16 Jan 2018 19:13:06 +0100
|> > > > Matthieu Hautreux <matthieu.hautreux at gmail.com> wrote:
|> > > > 
|> > > > > Hi,
|> > > > > 
|> > > > > In this kind if issues, one good thing to do is to get a
|> > > > > backtrace
|> > > > > of
|> > > > > slurmctld during the slowdown. You should thus easily
|> > > > > identify
|> > > > > the
|> > > > > subcomponent responsible for the issue.
|> > > > > 
|> > > > > I would bet on something like LDAP requests taking too much
|> > > > > time
|> > > > > because of a missing sssd cache.
|> > > > > 
|> > > > > Regards
|> > > > > Matthieu
|> > > > > 
|> > > > > Le 16 janv. 2018 18:59, "John DeSantis" <desantis at usf.edu> a
|> > > > > écrit
|> > > > > :
|> > > > > 
|> > > > > > -----BEGIN PGP SIGNED MESSAGE-----
|> > > > > > Hash: SHA512
|> > > > > >
|> > > > > > Ciao Alessandro,
|> > > > > >
|> > > > > > > setting MessageTimeout to 20 didn't solve it :(
|> > > > > > >
|> > > > > > > looking at slurmctld logs I noticed many warning like
|> > > > > > > these
|> > > > > > >
|> > > > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning:
|> > > > > > > Note
|> > > > > > > very large processing time from
|> > > > > > > _slurm_rpc_dump_partitions:
|> > > > > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58
|> > > > > > > r000u17l01
|> > > > > > > slurmctld[22307]: Warning: Note very large processing
|> > > > > > > time
|> > > > > > > from load_part_uid_allow_list: usec=44861325
|> > > > > > > began=05:20:13.257 Jan 16
|> > > > > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very
|> > > > > > > large
|> > > > > > > processing time from _slurmctld_background: usec=44861653
|> > > > > > > began=05:20:13.257
|> > > > > >
|> > > > > > And:
|> > > > > >
|> > > > > > > 271 Note very large processing time from
|> > > > > > > _slurm_rpc_dump_partitions: 67 Note very large processing
|> > > > > > > time
|> > > > > > > from load_part_uid_allow_list:
|> > > > > >
|> > > > > > I believe these values are in microseconds, so an average
|> > > > > > of
|> > > > > > 44
|> > > > > > seconds per call, mostly related to partition information.
|> > > > > > Given that our configuration has the maximum value set of
|> > > > > > 90
|> > > > > > seconds, I'd
|> > > > > > again recommend another adjustment, perhaps to 60 seconds.
|> > > > > >
|> > > > > > I'm not sure if redefining your partitions will help, but
|> > > > > > you
|> > > > > > do
|> > > > > > have several partitions which contain the same set of nodes
|> > > > > > that
|> > > > > > could be condensed - decreasing the amount of partitions.
|> > > > > >  For
|> > > > > > example, the partitions bdw_all_serial & bdw_all_rcm could
|> > > > > > be
|> > > > > > consolidated into a single partition by:
|> > > > > >
|> > > > > > 1.)  Using AllowQOS=bdw_all_serial,bdw_all_rcm;
|> > > > > > 2.)  Setting MaxTime to 04:00:00 and defining a MaxWall via
|> > > > > > each
|> > > > > > QOS
|> > > > > > (since one partition has 04:00:00 and the other 03:00:00).
|> > > > > >
|> > > > > > The same could be done for the partitions
|> > > > > > skl_fua_{prod,bprod,lprod} as well.
|> > > > > >
|> > > > > > HTH,
|> > > > > > John DeSantis
|> > > > > >
|> > > > > >
|> > > > > > On Tue, 16 Jan 2018 11:22:44 +0100
|> > > > > > Alessandro Federico <a.federico at cineca.it> wrote:
|> > > > > >
|> > > > > > > Hi,
|> > > > > > >
|> > > > > > > setting MessageTimeout to 20 didn't solve it :(
|> > > > > > >
|> > > > > > > looking at slurmctld logs I noticed many warning like
|> > > > > > > these
|> > > > > > >
|> > > > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning:
|> > > > > > > Note
|> > > > > > > very large processing time from
|> > > > > > > _slurm_rpc_dump_partitions:
|> > > > > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58
|> > > > > > > r000u17l01
|> > > > > > > slurmctld[22307]: Warning: Note very large processing
|> > > > > > > time
|> > > > > > > from load_part_uid_allow_list: usec=44861325
|> > > > > > > began=05:20:13.257 Jan 16
|> > > > > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very
|> > > > > > > large
|> > > > > > > processing time from _slurmctld_background: usec=44861653
|> > > > > > > began=05:20:13.257
|> > > > > > >
|> > > > > > > they are generated in many functions:
|> > > > > > >
|> > > > > > > [root at r000u17l01 ~]# journalctl -u slurmctld
|> > > > > > > --since='2018-01-16
|> > > > > > > 00:00:00'  | grep -oP 'Note very large processing time
|> > > > > > > from
|> > > > > > > \w+:'
|> > > > > > > | sort | uniq -c 4 Note very large processing time from
|> > > > > > > dump_all_job_state: 67 Note very large processing time
|> > > > > > > from
|> > > > > > > load_part_uid_allow_list: 67 Note very large processing
|> > > > > > > time
|> > > > > > > from
|> > > > > > > _slurmctld_background: 7 Note very large processing time
|> > > > > > > from
|> > > > > > > _slurm_rpc_complete_batch_script: 4 Note very large
|> > > > > > > processing
|> > > > > > > time from _slurm_rpc_dump_jobs: 3 Note very large
|> > > > > > > processing
|> > > > > > > time
|> > > > > > > from _slurm_rpc_dump_job_user: 271 Note very large
|> > > > > > > processing
|> > > > > > > time from _slurm_rpc_dump_partitions: 5 Note very large
|> > > > > > > processing time from _slurm_rpc_epilog_complete: 1 Note
|> > > > > > > very
|> > > > > > > large processing time from
|> > > > > > > _slurm_rpc_job_pack_alloc_info:
|> > > > > > > 3
|> > > > > > > Note
|> > > > > > > very large processing time from _slurm_rpc_step_complete:
|> > > > > > >
|> > > > > > > processing times are always around tens of seconds.
|> > > > > > >
|> > > > > > > I'm attaching sdiag output and slurm.conf.
|> > > > > > >
|> > > > > > > thanks
|> > > > > > > ale
|> > > > > > >
|> > > > > > > ----- Original Message -----
|> > > > > > > > From: "Trevor Cooper" <tcooper at sdsc.edu>
|> > > > > > > > To: "Slurm User Community List"
|> > > > > > > > <slurm-users at lists.schedmd.com>
|> > > > > > > > Sent: Tuesday, January 16, 2018 12:10:21 AM
|> > > > > > > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed
|> > > > > > > > out
|> > > > > > > > on send/recv operation
|> > > > > > > >
|> > > > > > > > Alessandro,
|> > > > > > > >
|> > > > > > > > You might want to consider tracking your Slurm
|> > > > > > > > scheduler
|> > > > > > > > diagnostics output with some type of time-series
|> > > > > > > > monitoring
|> > > > > > > > system. The time-based history has proven more helpful
|> > > > > > > > at
|> > > > > > > > times
|> > > > > > > > than log contents by themselves.
|> > > > > > > >
|> > > > > > > > See Giovanni Torres' post on setting this up...
|> > > > > > > >
|> > > > > > > >     http://giovannitorres.me/graphing-sdiag-with-graphite.html
|> > > > > > > >
|> > > > > > > > -- Trevor
|> > > > > > > >
|> > > > > > > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico
|> > > > > > > > > <a.federico at cineca.it> wrote:
|> > > > > > > > >
|> > > > > > > > > Hi John
|> > > > > > > > >
|> > > > > > > > > thanks for the info.
|> > > > > > > > > slurmctld doesn't report anything about the server
|> > > > > > > > > thread
|> > > > > > > > > count in the logs
|> > > > > > > > > and sdiag show only 3 server threads.
|> > > > > > > > >
|> > > > > > > > > We changed the MessageTimeout value to 20.
|> > > > > > > > >
|> > > > > > > > > I'll let you know if it solves the problem.
|> > > > > > > > >
|> > > > > > > > > Thanks
|> > > > > > > > > ale
|> > > > > > > > >
|> > > > > > > > > ----- Original Message -----
|> > > > > > > > >> From: "John DeSantis" <desantis at usf.edu>
|> > > > > > > > >> To: "Alessandro Federico" <a.federico at cineca.it>
|> > > > > > > > >> Cc: slurm-users at lists.schedmd.com, "Isabella
|> > > > > > > > >> Baccarelli"
|> > > > > > > > >> <i.baccarelli at cineca.it>, hpc-sysmgt-info at cineca.it
|> > > > > > > > >> Sent: Friday, January 12, 2018 7:58:38 PM
|> > > > > > > > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket
|> > > > > > > > >> timed
|> > > > > > > > >> out on
|> > > > > > > > >> send/recv operation
|> > > > > > > > >>
|> > > > > > > > >> Ciao Alessandro,
|> > > > > > > > >>
|> > > > > > > > >>> Do we have to apply any particular setting to avoid
|> > > > > > > > >>> incurring the problem?
|> > > > > > > > >>
|> > > > > > > > >> What is your "MessageTimeout" value in slurm.conf?
|> > > > > > > > >>  If
|> > > > > > > > >> it's
|> > > > > > > > >> at the default of 10, try changing it to 20.
|> > > > > > > > >>
|> > > > > > > > >> I'd also check and see if the slurmctld log is
|> > > > > > > > >> reporting
|> > > > > > > > >> anything pertaining to the server thread count being
|> > > > > > > > >> over
|> > > > > > > > >> its limit.
|> > > > > > > > >>
|> > > > > > > > >> HTH,
|> > > > > > > > >> John DeSantis
|> > > > > > > > >>
|> > > > > > > > >> On Fri, 12 Jan 2018 11:32:57 +0100
|> > > > > > > > >> Alessandro Federico <a.federico at cineca.it> wrote:
|> > > > > > > > >>
|> > > > > > > > >>> Hi all,
|> > > > > > > > >>>
|> > > > > > > > >>>
|> > > > > > > > >>> we are setting up SLURM 17.11.2 on a small test
|> > > > > > > > >>> cluster
|> > > > > > > > >>> of
|> > > > > > > > >>> about 100
|> > > > > > > > >>> nodes. Sometimes we get the error in the subject
|> > > > > > > > >>> when
|> > > > > > > > >>> running any SLURM command (e.g. sinfo, squeue,
|> > > > > > > > >>> scontrol
|> > > > > > > > >>> reconf, etc...)
|> > > > > > > > >>>
|> > > > > > > > >>>
|> > > > > > > > >>> Do we have to apply any particular setting to avoid
|> > > > > > > > >>> incurring the problem?
|> > > > > > > > >>>
|> > > > > > > > >>>
|> > > > > > > > >>> We found this bug report
|> > > > > > > > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but
|> > > > > > > > >>> it
|> > > > > > > > >>> regards the previous SLURM version and we do not
|> > > > > > > > >>> set
|> > > > > > > > >>> debug3
|> > > > > > > > >>> on slurmctld.
|> > > > > > > > >>>
|> > > > > > > > >>>
|> > > > > > > > >>> thanks in advance
|> > > > > > > > >>> ale
|> > > > > > > > >>>
|> > > > > > > > >>
|> > > > > > > > >>
|> > > > > > > > >
|> > > > > > > > > --
|> > > > > > > > > Alessandro Federico
|> > > > > > > > > HPC System Management Group
|> > > > > > > > > System & Technology Department
|> > > > > > > > > CINECA www.cineca.it
|> > > > > > > > > Via dei Tizii 6, 00185 Rome - Italy
|> > > > > > > > > phone: +39 06 44486708
|> > > > > > > > >
|> > > > > > > > > All work and no play makes Jack a dull boy.
|> > > > > > > > > All work and no play makes Jack a dull boy.
|> > > > > > > > > All work and no play makes Jack...
|> > > > > > > > >
|> > > > > > > >
|> > > > > > > >
|> > > > > > > >
|> > > > > > >
|> > > > > >
|> > > > > > -----BEGIN PGP SIGNATURE-----
|> > > > > > Version: GnuPG v2
|> > > > > >
|> > > > > > iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6
|> > > > > > vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD
|> > > > > > NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4
|> > > > > > KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB
|> > > > > > xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h
|> > > > > > /93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs=
|> > > > > > =VWA7
|> > > > > > -----END PGP SIGNATURE-----
|> > > > > >
|> > > > 
|> > > > 
|> > > > 
|> > > 
|> > 
|> > 
|> 
|> --
|> Alessandro Federico
|> HPC System Management Group
|> System & Technology Department
|> CINECA www.cineca.it
|> Via dei Tizii 6, 00185 Rome - Italy
|> phone: +39 06 44486708
|> 
|> All work and no play makes Jack a dull boy.
|> All work and no play makes Jack a dull boy.
|> All work and no play makes Jack...
|> 
|> 
| 
| --
| Alessandro Federico
| HPC System Management Group
| System & Technology Department
| CINECA www.cineca.it
| Via dei Tizii 6, 00185 Rome - Italy
| phone: +39 06 44486708
| 
| All work and no play makes Jack a dull boy.
| All work and no play makes Jack a dull boy.
| All work and no play makes Jack...

-- 
James A. Peltier
Manager - Research Computing Group | IT Services
Simon Fraser University | ASB10973
8888 University Dr., Burnaby, B.C. V5A 1S6
T: 778.782.6573 | M: 604.365.6432 | sfu.ca/itservices
Twitter: @sfu_it