<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Tahoma",sans-serif;
color:windowtext;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Socket timed out on send/recv operation<o:p></o:p></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">It sometimes occurs when a job array is started and squeue will display the error:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:.5in"><i><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">slurm_load_jobs error: Socket timed out on send/recv operation<o:p></o:p></span></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">We also see the following errors:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:.5in"><i><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">slurm_load_jobs error: Zero Bytes were transmitted or received<o:p></o:p></span></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">srun: error: Unable to allocate resources: Zero Bytes were transmitted or received<o:p></o:p></span></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Tahoma",sans-serif">sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Server thread count: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Agent queue size: 0<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs submitted: 14279<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs started: 7709<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs completed: 7001<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs canceled: 38<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Jobs failed: 0<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Main schedule statistics (microseconds):<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last cycle: 788<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Max cycle: 461780<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total cycles: 3319<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Mean cycle: 7589<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Mean depth cycle: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Cycles per minute: 4<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last queue length: 13<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total backfilled jobs (since last slurm start): 3204<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total backfilled jobs (since last stats cycle start): 3160<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Total cycles: 436<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last cycle when: Mon Jun 24 15:32:31 2019<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last cycle: 253698<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Max cycle: 12701861<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Mean cycle: 338674<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last depth cycle: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last depth cycle (try sched): 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Depth Mean: 15<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Depth Mean (try depth): 15<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Last queue length: 13<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> Queue length mean: 3<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Remote Procedure Call statistics by message type<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i>Remote Procedure Call statistics by user<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12475) count:37 ave_time:4985 total_time:184452<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11147) count:12 ave_time:33489 total_time:401874<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11345) count:6 ave_time:584 total_time:3508<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 12876) count:6 ave_time:483 total_time:2900<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i> xxxxx ( 11457) count:4 ave_time:345 total_time:1380<o:p></o:p></i></p>
<p class="MsoNormal" style="margin-left:.5in"><i><o:p> </o:p></i></p>
<p class="MsoNormal">Any suggestions/tips are helpful.<o:p></o:p></p>
<p class="MsoNormal">Rgds<o:p></o:p></p>
</div>
</body>
</html>