[slurm-users] Socket Timed Out on Send/Recv Operation
Yang Liu
yangliu at tamu.edu
Wed Apr 17 15:54:47 UTC 2019
We often received errors due to socket time out on send/recv opeartion:
slurm_load_jobs error: Socket timed out on send/recv operation
slurm_load_node: Socket timed out on send/recv operation
What could cause the errors? How likely job_submit.lua could cause such errors? We have a program running every 2 seconds collect information of pending jobs. Does that program cause the errors?
Our slurm version is 17.11
Some extra debug information from slurmctl.log:
[2019-04-15T04:34:47.094] debug: Note large processing time from job_submit_plugin_submit: usec=1317325 began=04:34:45.777
[2019-04-15T04:34:47.098] debug: Note large processing time from _slurm_rpc_complete_prolog: usec=1240300 began=04:34:45.858
[2019-04-15T04:34:49.744] debug: Note large processing time from job_submit_plugin_submit: usec=1301871 began=04:34:48.442
[2019-04-15T04:34:56.541] debug: Note large processing time from job_submit_plugin_submit: usec=1258167 began=04:34:55.283
[2019-04-15T04:34:58.620] debug: Note large processing time from job_submit_plugin_submit: usec=1295753 began=04:34:57.324
[2019-04-15T04:34:58.823] debug: Note large processing time from _slurmctld_background: usec=1229287 began=04:34:57.581
[2019-04-15T04:35:00.013] debug: Note large processing time from job_submit_plugin_submit: usec=1252367 began=04:34:58.761
[2019-04-15T04:35:01.435] debug: Note large processing time from job_submit_plugin_submit: usec=1278561 began=04:35:00.156
[2019-04-15T04:35:02.843] debug: Note large processing time from job_submit_plugin_submit: usec=1263240 began=04:35:01.579
[2019-04-15T04:35:03.111] debug: Note large processing time from dump_all_job_state: usec=1108738 began=04:35:02.002
[2019-04-15T04:35:04.100] debug: Note large processing time from job_submit_plugin_submit: usec=1254256 began=04:35:02.846
[2019-04-15T04:35:05.335] debug: Note large processing time from job_submit_plugin_submit: usec=2488678 began=04:35:02.846
Output from sdiag:
*******************************************************
sdiag output at Wed Apr 17 09:15:35 2019 (1555510535)
Data since Tue Apr 16 19:00:00 2019 (1555459200)
*******************************************************
Server thread count: 3
Agent queue size: 0
DBD Agent queue size: 0
Jobs submitted: 4907
Jobs started: 4900
Jobs completed: 4910
Jobs canceled: 28
Jobs failed: 0
Jobs running: 377
Jobs running ts: Wed Apr 17 09:15:13 2019 (1555510513)
Main schedule statistics (microseconds):
Last cycle: 2177
Max cycle: 289836
Total cycles: 9167
Mean cycle: 5660
Mean depth cycle: 32
Cycles per minute: 10
Last queue length: 27
Backfilling stats
Total backfilled jobs (since last slurm start): 133491
Total backfilled jobs (since last stats cycle start): 984
Total backfilled heterogeneous job components: 0
Total cycles: 1691
Last cycle when: Wed Apr 17 09:15:12 2019 (1555510512)
Last cycle: 51703
Max cycle: 699037
Mean cycle: 85826
Last depth cycle: 27
Last depth cycle (try sched): 27
Depth Mean: 33
Depth Mean (try depth): 33
Last queue length: 27
Queue length mean: 31
Remote Procedure Call statistics by message type
REQUEST_JOB_INFO ( 2003) count:1826319 ave_time:25114 total_time:45866876387
REQUEST_PARTITION_INFO ( 2009) count:1290235 ave_time:401 total_time:518371152
REQUEST_FED_INFO ( 2049) count:1052360 ave_time:401 total_time:422773504
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:868778 ave_time:37473 total_time:32555814249
REQUEST_JOB_USER_INFO ( 2039) count:704905 ave_time:22712 total_time:16010454603
REQUEST_JOB_INFO_SINGLE ( 2021) count:473505 ave_time:68461 total_time:32417060555
REQUEST_COMPLETE_PROLOG ( 6018) count:406364 ave_time:438771 total_time:178301089558
MESSAGE_EPILOG_COMPLETE ( 6012) count:405918 ave_time:237988 total_time:96603820959
REQUEST_STEP_COMPLETE ( 5016) count:403717 ave_time:215119 total_time:86847579110
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:366802 ave_time:368023 total_time:134991874450
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:318304 ave_time:2022617 total_time:643807267441
REQUEST_NODE_INFO ( 2007) count:67568 ave_time:38776 total_time:2620081867
REQUEST_PING ( 1008) count:55100 ave_time:293 total_time:16159038
REQUEST_JOB_STEP_CREATE ( 5001) count:37243 ave_time:18476 total_time:688122372
REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:36398 ave_time:36577 total_time:1331341719
REQUEST_KILL_JOB ( 5032) count:7821 ave_time:50206 total_time:392666200
REQUEST_CANCEL_JOB_STEP ( 5005) count:2342 ave_time:12425 total_time:29100941
REQUEST_BUILD_INFO ( 2001) count:574 ave_time:28934 total_time:16608216
REQUEST_JOB_NOTIFY ( 4022) count:547 ave_time:17239 total_time:9429961
ACCOUNTING_UPDATE_MSG (10001) count:457 ave_time:7636182 total_time:3489735519
REQUEST_NODE_INFO_SINGLE ( 2040) count:75 ave_time:281 total_time:21132
REQUEST_UPDATE_JOB ( 3001) count:21 ave_time:98102 total_time:2060154
REQUEST_UPDATE_PARTITION ( 3005) count:11 ave_time:748 total_time:8231
REQUEST_RESOURCE_ALLOCATION ( 4001) count:11 ave_time:470757 total_time:5178336
REQUEST_BATCH_SCRIPT ( 2051) count:10 ave_time:73539 total_time:735394
REQUEST_JOB_READY ( 4019) count:9 ave_time:305 total_time:2753
REQUEST_RESERVATION_INFO ( 2024) count:9 ave_time:324 total_time:2918
REQUEST_UPDATE_NODE ( 3002) count:8 ave_time:90478 total_time:723827
REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:4 ave_time:1225 total_time:4901
REQUEST_SHARE_INFO ( 2022) count:4 ave_time:45136 total_time:180547
REQUEST_CREATE_RESERVATION ( 3006) count:3 ave_time:3740 total_time:11220
REQUEST_UPDATE_RESERVATION ( 3009) count:3 ave_time:4174 total_time:12523
REQUEST_DELETE_RESERVATION ( 3008) count:3 ave_time:507 total_time:1523
REQUEST_JOB_WILL_RUN ( 4012) count:2 ave_time:1337100 total_time:2674200
REQUEST_JOB_STEP_INFO ( 2005) count:1 ave_time:374 total_time:374
REQUEST_STATS_INFO ( 2035) count:1 ave_time:280 total_time:280
REQUEST_TRIGGER_PULL ( 2030) count:1 ave_time:625 total_time:625
Remote Procedure Call statistics by user
root ( 0) count:2453662 ave_time:215740 total_time:529354938803
covenant07 ( 15246) count:2059328 ave_time:112083 total_time:230816339235
slurm ( 280) count:1976634 ave_time:25383 total_time:50174519509
zshang ( 17246) count:182449 ave_time:6807 total_time:1241946262
shaowen.mao ( 19650) count:162364 ave_time:18302 total_time:2971624851
cpattison ( 17344) count:101999 ave_time:19912 total_time:2031065227
parisamir ( 19240) count:95721 ave_time:292960 total_time:28042501476
francis ( 3112) count:74287 ave_time:19762 total_time:1468067338
djimenez ( 17823) count:37358 ave_time:266408 total_time:9952490492
yangyxjtu ( 18539) count:28396 ave_time:8983 total_time:255107082
rbovio ( 18281) count:22860 ave_time:8506720 total_time:194463621227
jmq811 ( 16240) count:21898 ave_time:70610 total_time:1546234039
stewart1983 ( 15971) count:15691 ave_time:45062 total_time:707071202
anjanatalapatra ( 14172) count:10720 ave_time:65625 total_time:703504099
Thanks.
Yang Liu, Ph.D.
Associate Research Scientist
High Performance Research Computing
Division of Research
Texas A&M University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190417/f2bdc622/attachment-0001.html>
More information about the slurm-users
mailing list