[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Wed Jan 17 07:30:43 MST 2018

Ale,

> As Matthieu said it seems something related to SSS daemon.

That was a great catch by Matthieu.

> Moreover, only 3 SLURM partitions have the AllowGroups ACL

Correct, which may seem negligent, but after each `scontrol
reconfigure`, slurmctld restart, and/or AllowGroups= partition update,
the mapping of UID's for each group will be updated.

> So why does the UID-GID mapping take so long?

We attempted to use "AllowGroups" previously, but we found (even with
sssd.conf tuning regarding enumeration) that unless the group was local
(/etc/group), we were experiencing delays before the AllowGroups
parameter was respected.  This is why we opted to use SLURM's
AllowQOS/AllowAccounts instead.

You would have to enable debugging on your remote authentication
software to see where the hang-up is occurring (if it is that at all,
and not just a delay with the slurmctld).

Given the direction that this is going - why not replace the
"AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?

> @John: we defined many partitions on the same nodes but in the
> production cluster they will be more or less split across the 6K
> nodes.

Ok, that makes sense.  Looking initially at your partition definitions,
I immediately thought of being DRY, especially since the "finer" tuning
between the partitions could easily be controlled via the QOS' allowed
to access the resources.

John DeSantis

On Wed, 17 Jan 2018 13:20:49 +0100
Alessandro Federico <a.federico at cineca.it> wrote:

> Hi Matthieu & John
> 
> this is the backtrace of slurmctld during the slowdown
> 
> (gdb) bt
> #0  0x00007fb0e8b1e69d in poll () from /lib64/libc.so.6
> #1  0x00007fb0e8617bfa in sss_cli_make_request_nochecks ()
> from /lib64/libnss_sss.so.2 #2  0x00007fb0e86185a3 in
> sss_nss_make_request () from /lib64/libnss_sss.so.2 #3
> 0x00007fb0e8619104 in _nss_sss_getpwnam_r ()
> from /lib64/libnss_sss.so.2 #4  0x00007fb0e8aef07d in
> getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5
> 0x00007fb0e9360986 in _getpwnam_r (result=<optimized out>,
> bufsiz=<optimized out>, buf=<optimized out>, pwd=<optimized out>,
> name=<optimized out>) at uid.c:73 #6  uid_from_string (name=0x1820e41
> "g2bottin", uidp=uidp at entry=0x7fff07f03a6c) at uid.c:111 #7
> 0x000000000043587d in get_group_members (group_name=0x10ac500 "g2")
> at groups.c:139 #8  0x000000000047525a in _get_groups_members
> (group_names=<optimized out>) at partition_mgr.c:2006 #9
> 0x0000000000475505 in _update_part_uid_access_list (x=0x7fb03401e650,
> arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 0x00007fb0e92ab675 in
> list_for_each (l=0x1763e50, f=f at entry=0x4754d8
> <_update_part_uid_access_list>, arg=arg at entry=0x7fff07f13bf4) at
> list.c:420 #11 0x000000000047911a in load_part_uid_allow_list
> (force=1) at partition_mgr.c:1971 #12 0x0000000000428e5c in
> _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main
> (argc=<optimized out>, argv=<optimized out>) at controller.c:601
> 
> As Matthieu said it seems something related to SSS daemon.
> However we don't notice any slowdown due to SSSd in our environment. 
> As I told you before, we are just testing SLURM on a small 100 nodes
> cluster before going into production with about 6000 nodes next
> Wednesday. At present the other nodes are managed by PBSPro and the 2
> PBS servers are running on the same nodes as the SLURM controllers.
> PBS queues are also configured with users/groups ACLs and we never
> noticed any similar slowdown.
> 
> Moreover, only 3 SLURM partitions have the AllowGroups ACL
> 
> [root at mgmt01 slurm]# grep AllowGroups slurm.conf 
> PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO
> DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00  State=UP
> QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2
> PartitionName=bdw_fua_gw    Nodes=r040c03s0[1,2] Default=NO
> DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00  State=UP
> QOS=bdw_fua_gw    DenyQos=bdw_qos_special AllowGroups=g2
> PartitionName=bdw_fua_gwg2  Nodes=r040c03s0[1,2] Default=NO
> DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP
> QOS=bdw_fua_gwg2  DenyQos=bdw_qos_special AllowGroups=g2
> 
> So why does the UID-GID mapping take so long?
> 
> @John: we defined many partitions on the same nodes but in the
> production cluster they will be more or less split across the 6K
> nodes.
> 
> thank you very much
> ale
> 
> ----- Original Message -----
> > From: "John DeSantis" <desantis at usf.edu>
> > To: "Matthieu Hautreux" <matthieu.hautreux at gmail.com>
> > Cc: hpc-sysmgt-info at cineca.it, "Slurm User Community List"
> > <slurm-users at lists.schedmd.com>, "Isabella Baccarelli"
> > <i.baccarelli at cineca.it> Sent: Tuesday, January 16, 2018 8:20:20 PM
> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> > send/recv operation
> > 
> > Matthieu,
> > 
> > > I would bet on something like LDAP requests taking too much time
> > > because of a missing sssd cache.
> > 
> > Good point!  It's easy to forget to check something as "simple" as
> > user
> > look-up when something is taking "too long".
> > 
> > John DeSantis
> > 
> > On Tue, 16 Jan 2018 19:13:06 +0100
> > Matthieu Hautreux <matthieu.hautreux at gmail.com> wrote:
> > 
> > > Hi,
> > > 
> > > In this kind if issues, one good thing to do is to get a backtrace
> > > of
> > > slurmctld during the slowdown. You should thus easily identify the
> > > subcomponent responsible for the issue.
> > > 
> > > I would bet on something like LDAP requests taking too much time
> > > because of a missing sssd cache.
> > > 
> > > Regards
> > > Matthieu
> > > 
> > > Le 16 janv. 2018 18:59, "John DeSantis" <desantis at usf.edu> a écrit
> > > :
> > > 
> > > > -----BEGIN PGP SIGNED MESSAGE-----
> > > > Hash: SHA512
> > > >
> > > > Ciao Alessandro,
> > > >
> > > > > setting MessageTimeout to 20 didn't solve it :(
> > > > >
> > > > > looking at slurmctld logs I noticed many warning like these
> > > > >
> > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note
> > > > > very large processing time from _slurm_rpc_dump_partitions:
> > > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58 r000u17l01
> > > > > slurmctld[22307]: Warning: Note very large processing time
> > > > > from load_part_uid_allow_list: usec=44861325
> > > > > began=05:20:13.257 Jan 16
> > > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> > > > > processing time from _slurmctld_background: usec=44861653
> > > > > began=05:20:13.257
> > > >
> > > > And:
> > > >
> > > > > 271 Note very large processing time from
> > > > > _slurm_rpc_dump_partitions: 67 Note very large processing time
> > > > > from load_part_uid_allow_list:
> > > >
> > > > I believe these values are in microseconds, so an average of 44
> > > > seconds per call, mostly related to partition information.
> > > > Given that our configuration has the maximum value set of 90
> > > > seconds, I'd
> > > > again recommend another adjustment, perhaps to 60 seconds.
> > > >
> > > > I'm not sure if redefining your partitions will help, but you do
> > > > have several partitions which contain the same set of nodes that
> > > > could be condensed - decreasing the amount of partitions.  For
> > > > example, the partitions bdw_all_serial & bdw_all_rcm could be
> > > > consolidated into a single partition by:
> > > >
> > > > 1.)  Using AllowQOS=bdw_all_serial,bdw_all_rcm;
> > > > 2.)  Setting MaxTime to 04:00:00 and defining a MaxWall via each
> > > > QOS
> > > > (since one partition has 04:00:00 and the other 03:00:00).
> > > >
> > > > The same could be done for the partitions
> > > > skl_fua_{prod,bprod,lprod} as well.
> > > >
> > > > HTH,
> > > > John DeSantis
> > > >
> > > >
> > > > On Tue, 16 Jan 2018 11:22:44 +0100
> > > > Alessandro Federico <a.federico at cineca.it> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > setting MessageTimeout to 20 didn't solve it :(
> > > > >
> > > > > looking at slurmctld logs I noticed many warning like these
> > > > >
> > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note
> > > > > very large processing time from _slurm_rpc_dump_partitions:
> > > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58 r000u17l01
> > > > > slurmctld[22307]: Warning: Note very large processing time
> > > > > from load_part_uid_allow_list: usec=44861325
> > > > > began=05:20:13.257 Jan 16
> > > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large
> > > > > processing time from _slurmctld_background: usec=44861653
> > > > > began=05:20:13.257
> > > > >
> > > > > they are generated in many functions:
> > > > >
> > > > > [root at r000u17l01 ~]# journalctl -u slurmctld
> > > > > --since='2018-01-16
> > > > > 00:00:00'  | grep -oP 'Note very large processing time from
> > > > > \w+:'
> > > > > | sort | uniq -c 4 Note very large processing time from
> > > > > dump_all_job_state: 67 Note very large processing time from
> > > > > load_part_uid_allow_list: 67 Note very large processing time
> > > > > from
> > > > > _slurmctld_background: 7 Note very large processing time from
> > > > > _slurm_rpc_complete_batch_script: 4 Note very large processing
> > > > > time from _slurm_rpc_dump_jobs: 3 Note very large processing
> > > > > time
> > > > > from _slurm_rpc_dump_job_user: 271 Note very large processing
> > > > > time from _slurm_rpc_dump_partitions: 5 Note very large
> > > > > processing time from _slurm_rpc_epilog_complete: 1 Note very
> > > > > large processing time from _slurm_rpc_job_pack_alloc_info: 3
> > > > > Note
> > > > > very large processing time from _slurm_rpc_step_complete:
> > > > >
> > > > > processing times are always around tens of seconds.
> > > > >
> > > > > I'm attaching sdiag output and slurm.conf.
> > > > >
> > > > > thanks
> > > > > ale
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Trevor Cooper" <tcooper at sdsc.edu>
> > > > > > To: "Slurm User Community List"
> > > > > > <slurm-users at lists.schedmd.com>
> > > > > > Sent: Tuesday, January 16, 2018 12:10:21 AM
> > > > > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out
> > > > > > on send/recv operation
> > > > > >
> > > > > > Alessandro,
> > > > > >
> > > > > > You might want to consider tracking your Slurm scheduler
> > > > > > diagnostics output with some type of time-series monitoring
> > > > > > system. The time-based history has proven more helpful at
> > > > > > times
> > > > > > than log contents by themselves.
> > > > > >
> > > > > > See Giovanni Torres' post on setting this up...
> > > > > >
> > > > > >     http://giovannitorres.me/graphing-sdiag-with-graphite.html
> > > > > >
> > > > > > -- Trevor
> > > > > >
> > > > > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico
> > > > > > > <a.federico at cineca.it> wrote:
> > > > > > >
> > > > > > > Hi John
> > > > > > >
> > > > > > > thanks for the info.
> > > > > > > slurmctld doesn't report anything about the server thread
> > > > > > > count in the logs
> > > > > > > and sdiag show only 3 server threads.
> > > > > > >
> > > > > > > We changed the MessageTimeout value to 20.
> > > > > > >
> > > > > > > I'll let you know if it solves the problem.
> > > > > > >
> > > > > > > Thanks
> > > > > > > ale
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > >> From: "John DeSantis" <desantis at usf.edu>
> > > > > > >> To: "Alessandro Federico" <a.federico at cineca.it>
> > > > > > >> Cc: slurm-users at lists.schedmd.com, "Isabella Baccarelli"
> > > > > > >> <i.baccarelli at cineca.it>, hpc-sysmgt-info at cineca.it
> > > > > > >> Sent: Friday, January 12, 2018 7:58:38 PM
> > > > > > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed
> > > > > > >> out on
> > > > > > >> send/recv operation
> > > > > > >>
> > > > > > >> Ciao Alessandro,
> > > > > > >>
> > > > > > >>> Do we have to apply any particular setting to avoid
> > > > > > >>> incurring the problem?
> > > > > > >>
> > > > > > >> What is your "MessageTimeout" value in slurm.conf?  If
> > > > > > >> it's
> > > > > > >> at the default of 10, try changing it to 20.
> > > > > > >>
> > > > > > >> I'd also check and see if the slurmctld log is reporting
> > > > > > >> anything pertaining to the server thread count being over
> > > > > > >> its limit.
> > > > > > >>
> > > > > > >> HTH,
> > > > > > >> John DeSantis
> > > > > > >>
> > > > > > >> On Fri, 12 Jan 2018 11:32:57 +0100
> > > > > > >> Alessandro Federico <a.federico at cineca.it> wrote:
> > > > > > >>
> > > > > > >>> Hi all,
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> we are setting up SLURM 17.11.2 on a small test cluster
> > > > > > >>> of
> > > > > > >>> about 100
> > > > > > >>> nodes. Sometimes we get the error in the subject when
> > > > > > >>> running any SLURM command (e.g. sinfo, squeue, scontrol
> > > > > > >>> reconf, etc...)
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Do we have to apply any particular setting to avoid
> > > > > > >>> incurring the problem?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> We found this bug report
> > > > > > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it
> > > > > > >>> regards the previous SLURM version and we do not set
> > > > > > >>> debug3
> > > > > > >>> on slurmctld.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> thanks in advance
> > > > > > >>> ale
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > > > --
> > > > > > > Alessandro Federico
> > > > > > > HPC System Management Group
> > > > > > > System & Technology Department
> > > > > > > CINECA www.cineca.it
> > > > > > > Via dei Tizii 6, 00185 Rome - Italy
> > > > > > > phone: +39 06 44486708
> > > > > > >
> > > > > > > All work and no play makes Jack a dull boy.
> > > > > > > All work and no play makes Jack a dull boy.
> > > > > > > All work and no play makes Jack...
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > -----BEGIN PGP SIGNATURE-----
> > > > Version: GnuPG v2
> > > >
> > > > iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6
> > > > vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD
> > > > NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4
> > > > KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB
> > > > xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h
> > > > /93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs=
> > > > =VWA7
> > > > -----END PGP SIGNATURE-----
> > > >
> > 
> > 
> > 
>