<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-style-priority:99;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Tahoma",sans-serif;
color:windowtext;}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.EmailStyle20
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.EmailStyle21
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle22
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.EmailStyle25
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">Is there a way to diagnose if the I/O to the /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the NFS storage is the cause of the socket errors?<o:p></o:p></p>
<p class="MsoNormal">What values/threshold from the nfsiostat command would signal the NFS storage as the bottleneck?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Buckley, Ronan <br>
<b>Sent:</b> Tuesday, June 25, 2019 11:21 AM<br>
<b>To:</b> Slurm User Community List; slurm-users-bounces@lists.schedmd.com<br>
<b>Subject:</b> RE: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I can reproduce the problem by submitting a job array of 700+.
<o:p></o:p></p>
<p class="MsoNormal">The slurmctld log file is also regularly outputting:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">[2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt<o:p></o:p></p>
<p class="MsoNormal">[2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt<o:p></o:p></p>
<p class="MsoNormal">[2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt<o:p></o:p></p>
<p class="MsoNormal">[2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt<o:p></o:p></p>
<p class="MsoNormal">[2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider configuring max_rpc_cnt<o:p></o:p></p>
<p class="MsoNormal">[2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The max_rpc_cnt is currently set to its default of zero. <o:p>
</o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Rgds<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Ronan<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com">slurm-users-bounces@lists.schedmd.com</a>>
<b>On Behalf Of </b>Marcelo Garcia<br>
<b>Sent:</b> Tuesday, June 25, 2019 10:35 AM<br>
<b>To:</b> Slurm User Community List<br>
<b>Subject:</b> Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p><span style="color:#CE1126">[EXTERNAL EMAIL] <o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="color:#1F497D">Hi <o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">It seems a problem we discussed a few days ago:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><a href="https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html">https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html</a><o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you have this problem randomly or it's
always happening on your jobs?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">I tried to get an idea how many RPCs would be OK, but I got no reply<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><a href="https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html">https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html</a><o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">My take is that there is no answer to the question, each site is different.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">Best Regards<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">mg.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users [<a href="mailto:slurm-users-bounces@lists.schedmd.com">mailto:slurm-users-bounces@lists.schedmd.com</a>]
<b>On Behalf Of </b>Buckley, Ronan<br>
<b>Sent:</b> Dienstag, 25. Juni 2019 11:17<br>
<b>To:</b> 'slurm-users@lists.schedmd.com' <<a href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a>>;
<a href="mailto:slurm-users-bounces@lists.schedmd.com">slurm-users-bounces@lists.schedmd.com</a><br>
<b>Subject:</b> [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Socket timed out on send/recv operation<o:p></o:p></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">It sometimes occurs when a job array is started and squeue will display the error:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:.5in"><i><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">slurm_load_jobs error: Socket timed out on send/recv operation<o:p></o:p></span></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">We also see the following errors:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:.5in"><i><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">slurm_load_jobs error: Zero Bytes were transmitted or received<o:p></o:p></span></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">srun: error: Unable to allocate resources: Zero Bytes were transmitted or received<o:p></o:p></span></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Server thread count: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Agent queue size: 0<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs submitted: 14279<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs started: 7709<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs completed: 7001<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs canceled: 38<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs failed: 0<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Main schedule statistics (microseconds):<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last cycle: 788<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Max cycle: 461780<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total cycles: 3319<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Mean cycle: 7589<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Mean depth cycle: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Cycles per minute: 4<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last queue length: 13<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total backfilled jobs (since last slurm start): 3204<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total backfilled jobs (since last stats cycle start): 3160<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total cycles: 436<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last cycle when: Mon Jun 24 15:32:31 2019<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last cycle: 253698<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Max cycle: 12701861<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Mean cycle: 338674<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last depth cycle: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last depth cycle (try sched): 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Depth Mean: 15<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Depth Mean (try depth): 15<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last queue length: 13<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Queue length mean: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Remote Procedure Call statistics by message type<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Remote Procedure Call statistics by user<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12475) count:37 ave_time:4985 total_time:184452<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11147) count:12 ave_time:33489 total_time:401874<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11345) count:6 ave_time:584 total_time:3508<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12876) count:6 ave_time:483 total_time:2900<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11457) count:4 ave_time:345 total_time:1380<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal">Any suggestions/tips are helpful.<o:p></o:p></p>
<p class="MsoNormal">Rgds<o:p></o:p></p>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:12.0pt;font-family:"Times New Roman",serif;background:white"><o:p> </o:p></span></p>
<p align="center" style="text-align:center"><span style="background:white">Click <a href="https://www.mailcontrol.com/sr/E3MG1ttEFmzGX2PQPOmvUrn00dwD0CtTR50NQzaa0Hzyu5oRJaiy8o4IRepqswOkHdrQZ5lrk5_gE3KctAewCA==">
here</a> to report this email as spam.<o:p></o:p></span></p>
</div>
</body>
</html>