<div dir="auto">Hi,<div dir="auto"><br></div><div dir="auto">In this kind if issues, one good thing to do is to get a backtrace of slurmctld during the slowdown. You should thus easily identify the subcomponent responsible for the issue.</div><div dir="auto"><br></div><div dir="auto">I would bet on something like LDAP requests taking too much time because of a missing sssd cache.</div><div dir="auto"><br></div><div dir="auto">Regards</div><div dir="auto">Matthieu</div></div><div class="gmail_extra"><br><div class="gmail_quote">Le 16 janv. 2018 18:59, "John DeSantis" <<a href="mailto:desantis@usf.edu">desantis@usf.edu</a>> a écrit :<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">-----BEGIN PGP SIGNED MESSAGE-----<br>

Hash: SHA512<br>

<br>

Ciao Alessandro,<br>

<br>

> setting MessageTimeout to 20 didn't solve it :(<br>

><br>

> looking at slurmctld logs I noticed many warning like these<br>

><br>

> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large<br>

> processing time from _slurm_rpc_dump_partitions: usec=42850604<br>

> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:<br>

> Warning: Note very large processing time from<br>

> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16<br>

> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large<br>

> processing time from _slurmctld_background: usec=44861653<br>

> began=05:20:13.257<br>

<br>

And:<br>

<br>

> 271 Note very large processing time from _slurm_rpc_dump_partitions:<br>

> 67 Note very large processing time from load_part_uid_allow_list:<br>

<br>

I believe these values are in microseconds, so an average of 44 seconds<br>

per call, mostly related to partition information.  Given that our<br>

configuration has the maximum value set of 90 seconds, I'd again<br>

recommend another adjustment, perhaps to 60 seconds.<br>

<br>

I'm not sure if redefining your partitions will help, but you do have<br>

several partitions which contain the same set of nodes that could be<br>

condensed - decreasing the amount of partitions.  For example, the<br>

partitions bdw_all_serial & bdw_all_rcm could be consolidated into a<br>

single partition by:<br>

<br>

1.)  Using AllowQOS=bdw_all_serial,bdw_<wbr>all_rcm;<br>

2.)  Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS<br>

(since one partition has 04:00:00 and the other 03:00:00).<br>

<br>

The same could be done for the partitions skl_fua_{prod,bprod,lprod} as<br>

well.<br>

<br>

HTH,<br>

John DeSantis<br>

<br>

<br>

On Tue, 16 Jan 2018 11:22:44 +0100<br>

Alessandro Federico <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>> wrote:<br>

<br>

> Hi,<br>

><br>

> setting MessageTimeout to 20 didn't solve it :(<br>

><br>

> looking at slurmctld logs I noticed many warning like these<br>

><br>

> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large<br>

> processing time from _slurm_rpc_dump_partitions: usec=42850604<br>

> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:<br>

> Warning: Note very large processing time from<br>

> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16<br>

> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large<br>

> processing time from _slurmctld_background: usec=44861653<br>

> began=05:20:13.257<br>

><br>

> they are generated in many functions:<br>

><br>

> [root@r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16<br>

> 00:00:00'  | grep -oP 'Note very large processing time from \w+:' |<br>

> sort | uniq -c 4 Note very large processing time from<br>

> dump_all_job_state: 67 Note very large processing time from<br>

> load_part_uid_allow_list: 67 Note very large processing time from<br>

> _slurmctld_background: 7 Note very large processing time from<br>

> _slurm_rpc_complete_batch_<wbr>script: 4 Note very large processing time<br>

> from _slurm_rpc_dump_jobs: 3 Note very large processing time from<br>

> _slurm_rpc_dump_job_user: 271 Note very large processing time from<br>

> _slurm_rpc_dump_partitions: 5 Note very large processing time from<br>

> _slurm_rpc_epilog_complete: 1 Note very large processing time from<br>

> _slurm_rpc_job_pack_alloc_<wbr>info: 3 Note very large processing time<br>

> from _slurm_rpc_step_complete:<br>

><br>

> processing times are always around tens of seconds.<br>

><br>

> I'm attaching sdiag output and slurm.conf.<br>

><br>

> thanks<br>

> ale<br>

><br>

> ----- Original Message -----<br>

> > From: "Trevor Cooper" <<a href="mailto:tcooper@sdsc.edu">tcooper@sdsc.edu</a>><br>

> > To: "Slurm User Community List" <<a href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a><wbr>><br>

> > Sent: Tuesday, January 16, 2018 12:10:21 AM<br>

> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on<br>

> > send/recv operation<br>

> ><br>

> > Alessandro,<br>

> ><br>

> > You might want to consider tracking your Slurm scheduler diagnostics<br>

> > output with some type of time-series monitoring system. The<br>

> > time-based history has proven more helpful at times than log<br>

> > contents by themselves.<br>

> ><br>

> > See Giovanni Torres' post on setting this up...<br>

> ><br>

> >     <a href="http://giovannitorres.me/graphing-sdiag-with-graphite.html" rel="noreferrer" target="_blank">http://giovannitorres.me/<wbr>graphing-sdiag-with-graphite.<wbr>html</a><br>

> ><br>

> > -- Trevor<br>

> ><br>

> > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico<br>

> > > <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>> wrote:<br>

> > ><br>

> > > Hi John<br>

> > ><br>

> > > thanks for the info.<br>

> > > slurmctld doesn't report anything about the server thread count in<br>

> > > the logs<br>

> > > and sdiag show only 3 server threads.<br>

> > ><br>

> > > We changed the MessageTimeout value to 20.<br>

> > ><br>

> > > I'll let you know if it solves the problem.<br>

> > ><br>

> > > Thanks<br>

> > > ale<br>

> > ><br>

> > > ----- Original Message -----<br>

> > >> From: "John DeSantis" <<a href="mailto:desantis@usf.edu">desantis@usf.edu</a>><br>

> > >> To: "Alessandro Federico" <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>><br>

> > >> Cc: <a href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a>, "Isabella Baccarelli"<br>

> > >> <<a href="mailto:i.baccarelli@cineca.it">i.baccarelli@cineca.it</a>>, <a href="mailto:hpc-sysmgt-info@cineca.it">hpc-sysmgt-info@cineca.it</a><br>

> > >> Sent: Friday, January 12, 2018 7:58:38 PM<br>

> > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on<br>

> > >> send/recv operation<br>

> > >><br>

> > >> Ciao Alessandro,<br>

> > >><br>

> > >>> Do we have to apply any particular setting to avoid incurring<br>

> > >>> the problem?<br>

> > >><br>

> > >> What is your "MessageTimeout" value in slurm.conf?  If it's at<br>

> > >> the default of 10, try changing it to 20.<br>

> > >><br>

> > >> I'd also check and see if the slurmctld log is reporting anything<br>

> > >> pertaining to the server thread count being over its limit.<br>

> > >><br>

> > >> HTH,<br>

> > >> John DeSantis<br>

> > >><br>

> > >> On Fri, 12 Jan 2018 11:32:57 +0100<br>

> > >> Alessandro Federico <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>> wrote:<br>

> > >><br>

> > >>> Hi all,<br>

> > >>><br>

> > >>><br>

> > >>> we are setting up SLURM 17.11.2 on a small test cluster of about<br>

> > >>> 100<br>

> > >>> nodes. Sometimes we get the error in the subject when running<br>

> > >>> any SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)<br>

> > >>><br>

> > >>><br>

> > >>> Do we have to apply any particular setting to avoid incurring<br>

> > >>> the problem?<br>

> > >>><br>

> > >>><br>

> > >>> We found this bug report<br>

> > >>> <a href="https://bugs.schedmd.com/show_bug.cgi?id=4002" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_<wbr>bug.cgi?id=4002</a> but it regards the<br>

> > >>> previous SLURM version and we do not set debug3 on slurmctld.<br>

> > >>><br>

> > >>><br>

> > >>> thanks in advance<br>

> > >>> ale<br>

> > >>><br>

> > >><br>

> > >><br>

> > ><br>

> > > --<br>

> > > Alessandro Federico<br>

> > > HPC System Management Group<br>

> > > System & Technology Department<br>

> > > CINECA <a href="http://www.cineca.it" rel="noreferrer" target="_blank">www.cineca.it</a><br>

> > > Via dei Tizii 6, 00185 Rome - Italy<br>

> > > phone: <a href="tel:%2B39%2006%2044486708" value="+390644486708">+39 06 44486708</a><br>

> > ><br>

> > > All work and no play makes Jack a dull boy.<br>

> > > All work and no play makes Jack a dull boy.<br>

> > > All work and no play makes Jack...<br>

> > ><br>

> ><br>

> ><br>

> ><br>

><br>

<br>

-----BEGIN PGP SIGNATURE-----<br>

Version: GnuPG v2<br>

<br>

iQEcBAEBCgAGBQJaXjxzAAoJEEmckB<wbr>qrs5nB9FQH/<wbr>Rq6avZRXV0r1qQhSBH514J6<br>

vHWzGAgVSvBrpxFrtfu3aVTK6fk3bF<wbr>ahB9t2jtVJlg0HgO8dm3Gj6FMNo0nD<wbr>yemD<br>

NlIePvvXGwZYXeXlif+OtCTu/<wbr>3fOqvuol1jX8/iXcG89Lm+<wbr>HA92BhLKPYoqzWsK4<br>

KQ/<wbr>m8Mlj91Ei3GRZorZfyZrRrfAYNatIV<wbr>2plmRaGWmuH39MEwQ0bF/qQhci/<wbr>LAXB<br>

xquAZWAVeSE1uWThXPS4sbzmHjNuen<wbr>T9RqlGtgQOEMO4z/<wbr>bHFQwmMVuxqfmS537h<br>

/<wbr>93icpAcWhJQ1bYe51ePykWk3Jkv901<wbr>Z7Cr6bG1+<wbr>hu2asN1loFzz38YugHUcfBs=<br>

=VWA7<br>

-----END PGP SIGNATURE-----<br>

</blockquote></div></div>