[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Buckley, Ronan Ronan.Buckley at Dell.com
Mon Jun 24 14:06:47 UTC 2019


Hi,

Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:

Socket timed out on send/recv operation

It sometimes occurs when a job array is started and squeue will display the error:

slurm_load_jobs error: Socket timed out on send/recv operation

We also see the following errors:

slurm_load_jobs error: Zero Bytes were transmitted or received
srun: error: Unable to allocate resources: Zero Bytes were transmitted or received

sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?

Server thread count: 3
Agent queue size:    0

Jobs submitted: 14279
Jobs started:   7709
Jobs completed: 7001
Jobs canceled:  38
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   788
        Max cycle:    461780
        Total cycles: 3319
        Mean cycle:   7589
        Mean depth cycle:  3
        Cycles per minute: 4
        Last queue length: 13

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
        Total backfilled jobs (since last slurm start): 3204
        Total backfilled jobs (since last stats cycle start): 3160
        Total cycles: 436
        Last cycle when: Mon Jun 24 15:32:31 2019
        Last cycle: 253698
        Max cycle:  12701861
        Mean cycle: 338674
        Last depth cycle: 3
        Last depth cycle (try sched): 3
        Depth Mean: 15
        Depth Mean (try depth): 15
        Last queue length: 13
        Queue length mean: 3

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:468871 ave_time:2188   total_time:1026211593
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:421773 ave_time:1775   total_time:748837928
        REQUEST_JOB_INFO                        ( 2003) count:46877  ave_time:696    total_time:32627442
        REQUEST_NODE_INFO                       ( 2007) count:43575  ave_time:1269   total_time:55301255
        REQUEST_JOB_STEP_INFO                   ( 2005) count:38703  ave_time:201    total_time:7805655
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:29155  ave_time:758    total_time:22118507
        REQUEST_JOB_USER_INFO                   ( 2039) count:22401  ave_time:391    total_time:8763503
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:7484   ave_time:6164   total_time:46132632
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:7064   ave_time:79129  total_time:558971262
        REQUEST_PING                            ( 1008) count:3561   ave_time:141    total_time:502289
        REQUEST_STATS_INFO                      ( 2035) count:3236   ave_time:568    total_time:1838784
        REQUEST_BUILD_INFO                      ( 2001) count:2598   ave_time:7869   total_time:20445066
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:581    ave_time:132730 total_time:77116427
        REQUEST_STEP_COMPLETE                   ( 5016) count:408    ave_time:4373   total_time:1784564
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:326    ave_time:14832  total_time:4835389
        REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:302    ave_time:15754  total_time:4757813
        REQUEST_JOB_READY                       ( 4019) count:78     ave_time:1615   total_time:125980
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:48     ave_time:7851   total_time:376856
        REQUEST_KILL_JOB                        ( 5032) count:38     ave_time:245    total_time:9346
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:28     ave_time:12730  total_time:356466
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:28     ave_time:20504  total_time:574137
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:7      ave_time:43665  total_time:305661

Remote Procedure Call statistics by user
        xxxxx           (       0) count:979383 ave_time:2500   total_time:2449350389
        xxxxx           (   11160) count:116109 ave_time:695    total_time:80710478
        xxxxx           (   11427) count:1264   ave_time:67572  total_time:85411027
        xxxxx           (   11426) count:149    ave_time:7361   total_time:1096874
        xxxxx           (   12818) count:136    ave_time:11354  total_time:1544190
        xxxxx           (   12475) count:37     ave_time:4985   total_time:184452
        xxxxx           (   12487) count:36     ave_time:30318  total_time:1091483
        xxxxx           (   11147) count:12     ave_time:33489  total_time:401874
        xxxxx           (   11345) count:6      ave_time:584    total_time:3508
        xxxxx           (   12876) count:6      ave_time:483    total_time:2900
        xxxxx           (   11457) count:4      ave_time:345    total_time:1380

Any suggestions/tips are helpful.
Rgds
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190624/0e4c44d7/attachment-0001.html>


More information about the slurm-users mailing list