[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
Buckley, Ronan
Ronan.Buckley at Dell.com
Mon Jun 24 14:06:47 UTC 2019
Hi,
Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:
Socket timed out on send/recv operation
It sometimes occurs when a job array is started and squeue will display the error:
slurm_load_jobs error: Socket timed out on send/recv operation
We also see the following errors:
slurm_load_jobs error: Zero Bytes were transmitted or received
srun: error: Unable to allocate resources: Zero Bytes were transmitted or received
sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?
Server thread count: 3
Agent queue size: 0
Jobs submitted: 14279
Jobs started: 7709
Jobs completed: 7001
Jobs canceled: 38
Jobs failed: 0
Main schedule statistics (microseconds):
Last cycle: 788
Max cycle: 461780
Total cycles: 3319
Mean cycle: 7589
Mean depth cycle: 3
Cycles per minute: 4
Last queue length: 13
Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
Total backfilled jobs (since last slurm start): 3204
Total backfilled jobs (since last stats cycle start): 3160
Total cycles: 436
Last cycle when: Mon Jun 24 15:32:31 2019
Last cycle: 253698
Max cycle: 12701861
Mean cycle: 338674
Last depth cycle: 3
Last depth cycle (try sched): 3
Depth Mean: 15
Depth Mean (try depth): 15
Last queue length: 13
Queue length mean: 3
Remote Procedure Call statistics by message type
REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593
REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928
REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442
REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255
REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507
REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503
MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262
REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289
REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784
REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427
REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564
REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389
REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813
REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980
REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856
REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346
REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466
REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137
REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661
Remote Procedure Call statistics by user
xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389
xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478
xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027
xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874
xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190
xxxxx ( 12475) count:37 ave_time:4985 total_time:184452
xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483
xxxxx ( 11147) count:12 ave_time:33489 total_time:401874
xxxxx ( 11345) count:6 ave_time:584 total_time:3508
xxxxx ( 12876) count:6 ave_time:483 total_time:2900
xxxxx ( 11457) count:4 ave_time:345 total_time:1380
Any suggestions/tips are helpful.
Rgds
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190624/0e4c44d7/attachment-0001.html>
More information about the slurm-users
mailing list