[slurm-users] Socket Timed Out on Send/Recv Operation

Yang Liu yangliu at tamu.edu
Wed Apr 17 15:54:47 UTC 2019


We often received errors due to socket time out on send/recv opeartion:

slurm_load_jobs error: Socket timed out on send/recv operation
slurm_load_node: Socket timed out on send/recv operation


What could cause the errors? How likely job_submit.lua could cause such errors? We have a program running every 2 seconds collect information of pending jobs. Does that program cause the errors?


Our slurm version is 17.11

Some extra debug information from slurmctl.log:

[2019-04-15T04:34:47.094] debug:  Note large processing time from job_submit_plugin_submit: usec=1317325 began=04:34:45.777
[2019-04-15T04:34:47.098] debug:  Note large processing time from _slurm_rpc_complete_prolog: usec=1240300 began=04:34:45.858
[2019-04-15T04:34:49.744] debug:  Note large processing time from job_submit_plugin_submit: usec=1301871 began=04:34:48.442
[2019-04-15T04:34:56.541] debug:  Note large processing time from job_submit_plugin_submit: usec=1258167 began=04:34:55.283
[2019-04-15T04:34:58.620] debug:  Note large processing time from job_submit_plugin_submit: usec=1295753 began=04:34:57.324
[2019-04-15T04:34:58.823] debug:  Note large processing time from _slurmctld_background: usec=1229287 began=04:34:57.581
[2019-04-15T04:35:00.013] debug:  Note large processing time from job_submit_plugin_submit: usec=1252367 began=04:34:58.761
[2019-04-15T04:35:01.435] debug:  Note large processing time from job_submit_plugin_submit: usec=1278561 began=04:35:00.156
[2019-04-15T04:35:02.843] debug:  Note large processing time from job_submit_plugin_submit: usec=1263240 began=04:35:01.579
[2019-04-15T04:35:03.111] debug:  Note large processing time from dump_all_job_state: usec=1108738 began=04:35:02.002
[2019-04-15T04:35:04.100] debug:  Note large processing time from job_submit_plugin_submit: usec=1254256 began=04:35:02.846
[2019-04-15T04:35:05.335] debug:  Note large processing time from job_submit_plugin_submit: usec=2488678 began=04:35:02.846



Output from sdiag:

*******************************************************
sdiag output at Wed Apr 17 09:15:35 2019 (1555510535)
Data since      Tue Apr 16 19:00:00 2019 (1555459200)
*******************************************************
Server thread count:  3
Agent queue size:     0
DBD Agent queue size: 0
Jobs submitted: 4907
Jobs started:   4900
Jobs completed: 4910
Jobs canceled:  28
Jobs failed:    0
Jobs running:    377
Jobs running ts: Wed Apr 17 09:15:13 2019 (1555510513)

Main schedule statistics (microseconds):
    Last cycle:   2177
    Max cycle:    289836
    Total cycles: 9167
    Mean cycle:   5660
    Mean depth cycle:  32
    Cycles per minute: 10
    Last queue length: 27

Backfilling stats
    Total backfilled jobs (since last slurm start): 133491
    Total backfilled jobs (since last stats cycle start): 984
    Total backfilled heterogeneous job components: 0
    Total cycles: 1691
    Last cycle when: Wed Apr 17 09:15:12 2019 (1555510512)
    Last cycle: 51703
    Max cycle:  699037
    Mean cycle: 85826
    Last depth cycle: 27
    Last depth cycle (try sched): 27
    Depth Mean: 33
    Depth Mean (try depth): 33
    Last queue length: 27
    Queue length mean: 31

Remote Procedure Call statistics by message type
    REQUEST_JOB_INFO                        ( 2003) count:1826319 ave_time:25114  total_time:45866876387
    REQUEST_PARTITION_INFO                  ( 2009) count:1290235 ave_time:401    total_time:518371152
    REQUEST_FED_INFO                        ( 2049) count:1052360 ave_time:401    total_time:422773504
    MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:868778 ave_time:37473  total_time:32555814249
    REQUEST_JOB_USER_INFO                   ( 2039) count:704905 ave_time:22712  total_time:16010454603
    REQUEST_JOB_INFO_SINGLE                 ( 2021) count:473505 ave_time:68461  total_time:32417060555
    REQUEST_COMPLETE_PROLOG                 ( 6018) count:406364 ave_time:438771 total_time:178301089558
    MESSAGE_EPILOG_COMPLETE                 ( 6012) count:405918 ave_time:237988 total_time:96603820959
    REQUEST_STEP_COMPLETE                   ( 5016) count:403717 ave_time:215119 total_time:86847579110
    REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:366802 ave_time:368023 total_time:134991874450
    REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:318304 ave_time:2022617 total_time:643807267441
    REQUEST_NODE_INFO                       ( 2007) count:67568  ave_time:38776  total_time:2620081867
    REQUEST_PING                            ( 1008) count:55100  ave_time:293    total_time:16159038
    REQUEST_JOB_STEP_CREATE                 ( 5001) count:37243  ave_time:18476  total_time:688122372
    REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:36398  ave_time:36577  total_time:1331341719
    REQUEST_KILL_JOB                        ( 5032) count:7821   ave_time:50206  total_time:392666200
    REQUEST_CANCEL_JOB_STEP                 ( 5005) count:2342   ave_time:12425  total_time:29100941
    REQUEST_BUILD_INFO                      ( 2001) count:574    ave_time:28934  total_time:16608216
    REQUEST_JOB_NOTIFY                      ( 4022) count:547    ave_time:17239  total_time:9429961
    ACCOUNTING_UPDATE_MSG                   (10001) count:457    ave_time:7636182 total_time:3489735519
    REQUEST_NODE_INFO_SINGLE                ( 2040) count:75     ave_time:281    total_time:21132
    REQUEST_UPDATE_JOB                      ( 3001) count:21     ave_time:98102  total_time:2060154
    REQUEST_UPDATE_PARTITION                ( 3005) count:11     ave_time:748    total_time:8231
    REQUEST_RESOURCE_ALLOCATION             ( 4001) count:11     ave_time:470757 total_time:5178336
    REQUEST_BATCH_SCRIPT                    ( 2051) count:10     ave_time:73539  total_time:735394
    REQUEST_JOB_READY                       ( 4019) count:9      ave_time:305    total_time:2753
    REQUEST_RESERVATION_INFO                ( 2024) count:9      ave_time:324    total_time:2918
    REQUEST_UPDATE_NODE                     ( 3002) count:8      ave_time:90478  total_time:723827
    REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:4      ave_time:1225   total_time:4901
    REQUEST_SHARE_INFO                      ( 2022) count:4      ave_time:45136  total_time:180547
    REQUEST_CREATE_RESERVATION              ( 3006) count:3      ave_time:3740   total_time:11220
    REQUEST_UPDATE_RESERVATION              ( 3009) count:3      ave_time:4174   total_time:12523
    REQUEST_DELETE_RESERVATION              ( 3008) count:3      ave_time:507    total_time:1523
    REQUEST_JOB_WILL_RUN                    ( 4012) count:2      ave_time:1337100 total_time:2674200
    REQUEST_JOB_STEP_INFO                   ( 2005) count:1      ave_time:374    total_time:374
    REQUEST_STATS_INFO                      ( 2035) count:1      ave_time:280    total_time:280
    REQUEST_TRIGGER_PULL                    ( 2030) count:1      ave_time:625    total_time:625

Remote Procedure Call statistics by user
    root            (       0) count:2453662 ave_time:215740 total_time:529354938803
    covenant07      (   15246) count:2059328 ave_time:112083 total_time:230816339235
    slurm           (     280) count:1976634 ave_time:25383  total_time:50174519509
    zshang          (   17246) count:182449 ave_time:6807   total_time:1241946262
    shaowen.mao     (   19650) count:162364 ave_time:18302  total_time:2971624851
    cpattison       (   17344) count:101999 ave_time:19912  total_time:2031065227
    parisamir       (   19240) count:95721  ave_time:292960 total_time:28042501476
    francis         (    3112) count:74287  ave_time:19762  total_time:1468067338
    djimenez        (   17823) count:37358  ave_time:266408 total_time:9952490492
    yangyxjtu       (   18539) count:28396  ave_time:8983   total_time:255107082
    rbovio          (   18281) count:22860  ave_time:8506720 total_time:194463621227
    jmq811          (   16240) count:21898  ave_time:70610  total_time:1546234039
    stewart1983     (   15971) count:15691  ave_time:45062  total_time:707071202
    anjanatalapatra (   14172) count:10720  ave_time:65625  total_time:703504099

Thanks.


Yang Liu, Ph.D.
Associate Research Scientist
High Performance Research Computing
Division of Research
Texas A&M University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190417/f2bdc622/attachment-0001.html>


More information about the slurm-users mailing list