<div dir="auto">Hi,<div dir="auto"><br></div><div dir="auto">In this kind if issues, one good thing to do is to get a backtrace of slurmctld during the slowdown. You should thus easily identify the subcomponent responsible for the issue.</div><div dir="auto"><br></div><div dir="auto">I would bet on something like LDAP requests taking too much time because of a missing sssd cache.</div><div dir="auto"><br></div><div dir="auto">Regards</div><div dir="auto">Matthieu</div></div><div class="gmail_extra"><br><div class="gmail_quote">Le 16 janv. 2018 18:59, "John DeSantis" <<a href="mailto:desantis@usf.edu">desantis@usf.edu</a>> a écrit :<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA512<br>
<br>
Ciao Alessandro,<br>
<br>
> setting MessageTimeout to 20 didn't solve it :(<br>
><br>
> looking at slurmctld logs I noticed many warning like these<br>
><br>
> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large<br>
> processing time from _slurm_rpc_dump_partitions: usec=42850604<br>
> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:<br>
> Warning: Note very large processing time from<br>
> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16<br>
> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large<br>
> processing time from _slurmctld_background: usec=44861653<br>
> began=05:20:13.257<br>
<br>
And:<br>
<br>
> 271 Note very large processing time from _slurm_rpc_dump_partitions:<br>
> 67 Note very large processing time from load_part_uid_allow_list:<br>
<br>
I believe these values are in microseconds, so an average of 44 seconds<br>
per call, mostly related to partition information. Given that our<br>
configuration has the maximum value set of 90 seconds, I'd again<br>
recommend another adjustment, perhaps to 60 seconds.<br>
<br>
I'm not sure if redefining your partitions will help, but you do have<br>
several partitions which contain the same set of nodes that could be<br>
condensed - decreasing the amount of partitions. For example, the<br>
partitions bdw_all_serial & bdw_all_rcm could be consolidated into a<br>
single partition by:<br>
<br>
1.) Using AllowQOS=bdw_all_serial,bdw_<wbr>all_rcm;<br>
2.) Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS<br>
(since one partition has 04:00:00 and the other 03:00:00).<br>
<br>
The same could be done for the partitions skl_fua_{prod,bprod,lprod} as<br>
well.<br>
<br>
HTH,<br>
John DeSantis<br>
<br>
<br>
On Tue, 16 Jan 2018 11:22:44 +0100<br>
Alessandro Federico <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>> wrote:<br>
<br>
> Hi,<br>
><br>
> setting MessageTimeout to 20 didn't solve it :(<br>
><br>
> looking at slurmctld logs I noticed many warning like these<br>
><br>
> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large<br>
> processing time from _slurm_rpc_dump_partitions: usec=42850604<br>
> began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]:<br>
> Warning: Note very large processing time from<br>
> load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16<br>
> 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large<br>
> processing time from _slurmctld_background: usec=44861653<br>
> began=05:20:13.257<br>
><br>
> they are generated in many functions:<br>
><br>
> [root@r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16<br>
> 00:00:00' | grep -oP 'Note very large processing time from \w+:' |<br>
> sort | uniq -c 4 Note very large processing time from<br>
> dump_all_job_state: 67 Note very large processing time from<br>
> load_part_uid_allow_list: 67 Note very large processing time from<br>
> _slurmctld_background: 7 Note very large processing time from<br>
> _slurm_rpc_complete_batch_<wbr>script: 4 Note very large processing time<br>
> from _slurm_rpc_dump_jobs: 3 Note very large processing time from<br>
> _slurm_rpc_dump_job_user: 271 Note very large processing time from<br>
> _slurm_rpc_dump_partitions: 5 Note very large processing time from<br>
> _slurm_rpc_epilog_complete: 1 Note very large processing time from<br>
> _slurm_rpc_job_pack_alloc_<wbr>info: 3 Note very large processing time<br>
> from _slurm_rpc_step_complete:<br>
><br>
> processing times are always around tens of seconds.<br>
><br>
> I'm attaching sdiag output and slurm.conf.<br>
><br>
> thanks<br>
> ale<br>
><br>
> ----- Original Message -----<br>
> > From: "Trevor Cooper" <<a href="mailto:tcooper@sdsc.edu">tcooper@sdsc.edu</a>><br>
> > To: "Slurm User Community List" <<a href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a><wbr>><br>
> > Sent: Tuesday, January 16, 2018 12:10:21 AM<br>
> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on<br>
> > send/recv operation<br>
> ><br>
> > Alessandro,<br>
> ><br>
> > You might want to consider tracking your Slurm scheduler diagnostics<br>
> > output with some type of time-series monitoring system. The<br>
> > time-based history has proven more helpful at times than log<br>
> > contents by themselves.<br>
> ><br>
> > See Giovanni Torres' post on setting this up...<br>
> ><br>
> > <a href="http://giovannitorres.me/graphing-sdiag-with-graphite.html" rel="noreferrer" target="_blank">http://giovannitorres.me/<wbr>graphing-sdiag-with-graphite.<wbr>html</a><br>
> ><br>
> > -- Trevor<br>
> ><br>
> > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico<br>
> > > <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>> wrote:<br>
> > ><br>
> > > Hi John<br>
> > ><br>
> > > thanks for the info.<br>
> > > slurmctld doesn't report anything about the server thread count in<br>
> > > the logs<br>
> > > and sdiag show only 3 server threads.<br>
> > ><br>
> > > We changed the MessageTimeout value to 20.<br>
> > ><br>
> > > I'll let you know if it solves the problem.<br>
> > ><br>
> > > Thanks<br>
> > > ale<br>
> > ><br>
> > > ----- Original Message -----<br>
> > >> From: "John DeSantis" <<a href="mailto:desantis@usf.edu">desantis@usf.edu</a>><br>
> > >> To: "Alessandro Federico" <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>><br>
> > >> Cc: <a href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a>, "Isabella Baccarelli"<br>
> > >> <<a href="mailto:i.baccarelli@cineca.it">i.baccarelli@cineca.it</a>>, <a href="mailto:hpc-sysmgt-info@cineca.it">hpc-sysmgt-info@cineca.it</a><br>
> > >> Sent: Friday, January 12, 2018 7:58:38 PM<br>
> > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on<br>
> > >> send/recv operation<br>
> > >><br>
> > >> Ciao Alessandro,<br>
> > >><br>
> > >>> Do we have to apply any particular setting to avoid incurring<br>
> > >>> the problem?<br>
> > >><br>
> > >> What is your "MessageTimeout" value in slurm.conf? If it's at<br>
> > >> the default of 10, try changing it to 20.<br>
> > >><br>
> > >> I'd also check and see if the slurmctld log is reporting anything<br>
> > >> pertaining to the server thread count being over its limit.<br>
> > >><br>
> > >> HTH,<br>
> > >> John DeSantis<br>
> > >><br>
> > >> On Fri, 12 Jan 2018 11:32:57 +0100<br>
> > >> Alessandro Federico <<a href="mailto:a.federico@cineca.it">a.federico@cineca.it</a>> wrote:<br>
> > >><br>
> > >>> Hi all,<br>
> > >>><br>
> > >>><br>
> > >>> we are setting up SLURM 17.11.2 on a small test cluster of about<br>
> > >>> 100<br>
> > >>> nodes. Sometimes we get the error in the subject when running<br>
> > >>> any SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...)<br>
> > >>><br>
> > >>><br>
> > >>> Do we have to apply any particular setting to avoid incurring<br>
> > >>> the problem?<br>
> > >>><br>
> > >>><br>
> > >>> We found this bug report<br>
> > >>> <a href="https://bugs.schedmd.com/show_bug.cgi?id=4002" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_<wbr>bug.cgi?id=4002</a> but it regards the<br>
> > >>> previous SLURM version and we do not set debug3 on slurmctld.<br>
> > >>><br>
> > >>><br>
> > >>> thanks in advance<br>
> > >>> ale<br>
> > >>><br>
> > >><br>
> > >><br>
> > ><br>
> > > --<br>
> > > Alessandro Federico<br>
> > > HPC System Management Group<br>
> > > System & Technology Department<br>
> > > CINECA <a href="http://www.cineca.it" rel="noreferrer" target="_blank">www.cineca.it</a><br>
> > > Via dei Tizii 6, 00185 Rome - Italy<br>
> > > phone: <a href="tel:%2B39%2006%2044486708" value="+390644486708">+39 06 44486708</a><br>
> > ><br>
> > > All work and no play makes Jack a dull boy.<br>
> > > All work and no play makes Jack a dull boy.<br>
> > > All work and no play makes Jack...<br>
> > ><br>
> ><br>
> ><br>
> ><br>
><br>
<br>
-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v2<br>
<br>
iQEcBAEBCgAGBQJaXjxzAAoJEEmckB<wbr>qrs5nB9FQH/<wbr>Rq6avZRXV0r1qQhSBH514J6<br>
vHWzGAgVSvBrpxFrtfu3aVTK6fk3bF<wbr>ahB9t2jtVJlg0HgO8dm3Gj6FMNo0nD<wbr>yemD<br>
NlIePvvXGwZYXeXlif+OtCTu/<wbr>3fOqvuol1jX8/iXcG89Lm+<wbr>HA92BhLKPYoqzWsK4<br>
KQ/<wbr>m8Mlj91Ei3GRZorZfyZrRrfAYNatIV<wbr>2plmRaGWmuH39MEwQ0bF/qQhci/<wbr>LAXB<br>
xquAZWAVeSE1uWThXPS4sbzmHjNuen<wbr>T9RqlGtgQOEMO4z/<wbr>bHFQwmMVuxqfmS537h<br>
/<wbr>93icpAcWhJQ1bYe51ePykWk3Jkv901<wbr>Z7Cr6bG1+<wbr>hu2asN1loFzz38YugHUcfBs=<br>
=VWA7<br>
-----END PGP SIGNATURE-----<br>
</blockquote></div></div>