[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
Buckley, Ronan
Ronan.Buckley at Dell.com
Tue Jun 25 10:28:45 UTC 2019
Is there a way to diagnose if the I/O to the /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the NFS storage is the cause of the socket errors?
What values/threshold from the nfsiostat command would signal the NFS storage as the bottleneck?
From: Buckley, Ronan
Sent: Tuesday, June 25, 2019 11:21 AM
To: Slurm User Community List; slurm-users-bounces at lists.schedmd.com
Subject: RE: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
Hi,
I can reproduce the problem by submitting a job array of 700+.
The slurmctld log file is also regularly outputting:
[2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
The max_rpc_cnt is currently set to its default of zero.
Rgds
Ronan
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Marcelo Garcia
Sent: Tuesday, June 25, 2019 10:35 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
[EXTERNAL EMAIL]
Hi
It seems a problem we discussed a few days ago:
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you have this problem randomly or it's always happening on your jobs?
I tried to get an idea how many RPCs would be OK, but I got no reply
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html
My take is that there is no answer to the question, each site is different.
Best Regards
mg.
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
Sent: Dienstag, 25. Juni 2019 11:17
To: 'slurm-users at lists.schedmd.com' <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>; slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>
Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
Hi,
Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:
Socket timed out on send/recv operation
It sometimes occurs when a job array is started and squeue will display the error:
slurm_load_jobs error: Socket timed out on send/recv operation
We also see the following errors:
slurm_load_jobs error: Zero Bytes were transmitted or received
srun: error: Unable to allocate resources: Zero Bytes were transmitted or received
sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?
Server thread count: 3
Agent queue size: 0
Jobs submitted: 14279
Jobs started: 7709
Jobs completed: 7001
Jobs canceled: 38
Jobs failed: 0
Main schedule statistics (microseconds):
Last cycle: 788
Max cycle: 461780
Total cycles: 3319
Mean cycle: 7589
Mean depth cycle: 3
Cycles per minute: 4
Last queue length: 13
Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
Total backfilled jobs (since last slurm start): 3204
Total backfilled jobs (since last stats cycle start): 3160
Total cycles: 436
Last cycle when: Mon Jun 24 15:32:31 2019
Last cycle: 253698
Max cycle: 12701861
Mean cycle: 338674
Last depth cycle: 3
Last depth cycle (try sched): 3
Depth Mean: 15
Depth Mean (try depth): 15
Last queue length: 13
Queue length mean: 3
Remote Procedure Call statistics by message type
REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593
REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928
REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442
REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255
REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507
REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503
MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262
REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289
REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784
REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427
REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564
REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389
REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813
REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980
REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856
REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346
REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466
REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137
REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661
Remote Procedure Call statistics by user
xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389
xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478
xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027
xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874
xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190
xxxxx ( 12475) count:37 ave_time:4985 total_time:184452
xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483
xxxxx ( 11147) count:12 ave_time:33489 total_time:401874
xxxxx ( 11345) count:6 ave_time:584 total_time:3508
xxxxx ( 12876) count:6 ave_time:483 total_time:2900
xxxxx ( 11457) count:4 ave_time:345 total_time:1380
Any suggestions/tips are helpful.
Rgds
Click here<https://www.mailcontrol.com/sr/E3MG1ttEFmzGX2PQPOmvUrn00dwD0CtTR50NQzaa0Hzyu5oRJaiy8o4IRepqswOkHdrQZ5lrk5_gE3KctAewCA==> to report this email as spam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190625/86196631/attachment-0001.html>
More information about the slurm-users
mailing list