[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
Eli V
eliventer at gmail.com
Tue Jun 25 13:01:12 UTC 2019
Just FYI, I tried the shared state on NFS once, and it didn't work
well. Switched to native client glusterfs shared between the 2
controller nodes and haven't had a problem with it since.
On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>
> Is there a way to diagnose if the I/O to the /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the NFS storage is the cause of the socket errors?
>
> What values/threshold from the nfsiostat command would signal the NFS storage as the bottleneck?
>
>
>
> From: Buckley, Ronan
> Sent: Tuesday, June 25, 2019 11:21 AM
> To: Slurm User Community List; slurm-users-bounces at lists.schedmd.com
> Subject: RE: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
>
>
>
> Hi,
>
>
>
> I can reproduce the problem by submitting a job array of 700+.
>
> The slurmctld log file is also regularly outputting:
>
>
>
> [2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
>
>
> The max_rpc_cnt is currently set to its default of zero.
>
>
>
> Rgds
>
>
>
> Ronan
>
>
>
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcelo Garcia
> Sent: Tuesday, June 25, 2019 10:35 AM
> To: Slurm User Community List
> Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
>
>
>
> [EXTERNAL EMAIL]
>
> Hi
>
>
>
> It seems a problem we discussed a few days ago:
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
>
> But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you have this problem randomly or it's always happening on your jobs?
>
>
>
> I tried to get an idea how many RPCs would be OK, but I got no reply
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html
>
> My take is that there is no answer to the question, each site is different.
>
>
>
> Best Regards
>
>
>
> mg.
>
>
>
> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
> Sent: Dienstag, 25. Juni 2019 11:17
> To: 'slurm-users at lists.schedmd.com' <slurm-users at lists.schedmd.com>; slurm-users-bounces at lists.schedmd.com
> Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
>
>
>
> Hi,
>
>
>
> Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:
>
>
>
> Socket timed out on send/recv operation
>
>
>
> It sometimes occurs when a job array is started and squeue will display the error:
>
>
>
> slurm_load_jobs error: Socket timed out on send/recv operation
>
>
>
> We also see the following errors:
>
>
>
> slurm_load_jobs error: Zero Bytes were transmitted or received
>
> srun: error: Unable to allocate resources: Zero Bytes were transmitted or received
>
>
>
> sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?
>
>
>
> Server thread count: 3
>
> Agent queue size: 0
>
>
>
> Jobs submitted: 14279
>
> Jobs started: 7709
>
> Jobs completed: 7001
>
> Jobs canceled: 38
>
> Jobs failed: 0
>
>
>
> Main schedule statistics (microseconds):
>
> Last cycle: 788
>
> Max cycle: 461780
>
> Total cycles: 3319
>
> Mean cycle: 7589
>
> Mean depth cycle: 3
>
> Cycles per minute: 4
>
> Last queue length: 13
>
>
>
> Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
>
> Total backfilled jobs (since last slurm start): 3204
>
> Total backfilled jobs (since last stats cycle start): 3160
>
> Total cycles: 436
>
> Last cycle when: Mon Jun 24 15:32:31 2019
>
> Last cycle: 253698
>
> Max cycle: 12701861
>
> Mean cycle: 338674
>
> Last depth cycle: 3
>
> Last depth cycle (try sched): 3
>
> Depth Mean: 15
>
> Depth Mean (try depth): 15
>
> Last queue length: 13
>
> Queue length mean: 3
>
>
>
> Remote Procedure Call statistics by message type
>
> REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593
>
> REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928
>
> REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442
>
> REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255
>
> REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655
>
> MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507
>
> REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503
>
> MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632
>
> REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262
>
> REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289
>
> REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784
>
> REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066
>
> REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427
>
> REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564
>
> REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389
>
> REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813
>
> REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980
>
> REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856
>
> REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346
>
> REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466
>
> REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137
>
> REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661
>
>
>
> Remote Procedure Call statistics by user
>
> xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389
>
> xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478
>
> xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027
>
> xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874
>
> xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190
>
> xxxxx ( 12475) count:37 ave_time:4985 total_time:184452
>
> xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483
>
> xxxxx ( 11147) count:12 ave_time:33489 total_time:401874
>
> xxxxx ( 11345) count:6 ave_time:584 total_time:3508
>
> xxxxx ( 12876) count:6 ave_time:483 total_time:2900
>
> xxxxx ( 11457) count:4 ave_time:345 total_time:1380
>
>
>
> Any suggestions/tips are helpful.
>
> Rgds
>
>
>
> Click here to report this email as spam.
More information about the slurm-users
mailing list