[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Tue Jun 25 13:01:12 UTC 2019

Just FYI, I tried the shared state on NFS once, and it didn't work
well. Switched to native client glusterfs shared between the 2
controller nodes and haven't had a problem with it since.

On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>
> Is there a way to diagnose if the I/O to the /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the NFS storage is the cause of the socket errors?
>
> What values/threshold from the nfsiostat command would signal the NFS storage as the bottleneck?
>
>
>
> From: Buckley, Ronan
> Sent: Tuesday, June 25, 2019 11:21 AM
> To: Slurm User Community List; slurm-users-bounces at lists.schedmd.com
> Subject: RE: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
>
>
>
> Hi,
>
>
>
> I can reproduce the problem by submitting a job array of 700+.
>
> The slurmctld log file is also regularly outputting:
>
>
>
> [2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
> [2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
>
>
>
> The max_rpc_cnt is currently set to its default of zero.
>
>
>
> Rgds
>
>
>
> Ronan
>
>
>
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcelo Garcia
> Sent: Tuesday, June 25, 2019 10:35 AM
> To: Slurm User Community List
> Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
>
>
>
> [EXTERNAL EMAIL]
>
> Hi
>
>
>
> It seems a problem we discussed a few days ago:
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
>
> But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you have this problem randomly or it's always happening on your jobs?
>
>
>
> I tried to get an idea how many RPCs would be OK, but I got no reply
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html
>
> My take is that there is no answer to the question, each site is different.
>
>
>
> Best Regards
>
>
>
> mg.
>
>
>
> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
> Sent: Dienstag, 25. Juni 2019 11:17
> To: 'slurm-users at lists.schedmd.com' <slurm-users at lists.schedmd.com>; slurm-users-bounces at lists.schedmd.com
> Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2
>
>
>
> Hi,
>
>
>
> Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:
>
>
>
> Socket timed out on send/recv operation
>
>
>
> It sometimes occurs when a job array is started and squeue will display the error:
>
>
>
> slurm_load_jobs error: Socket timed out on send/recv operation
>
>
>
> We also see the following errors:
>
>
>
> slurm_load_jobs error: Zero Bytes were transmitted or received
>
> srun: error: Unable to allocate resources: Zero Bytes were transmitted or received
>
>
>
> sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?
>
>
>
> Server thread count: 3
>
> Agent queue size:    0
>
>
>
> Jobs submitted: 14279
>
> Jobs started:   7709
>
> Jobs completed: 7001
>
> Jobs canceled:  38
>
> Jobs failed:    0
>
>
>
> Main schedule statistics (microseconds):
>
>         Last cycle:   788
>
>         Max cycle:    461780
>
>         Total cycles: 3319
>
>         Mean cycle:   7589
>
>         Mean depth cycle:  3
>
>         Cycles per minute: 4
>
>         Last queue length: 13
>
>
>
> Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
>
>         Total backfilled jobs (since last slurm start): 3204
>
>         Total backfilled jobs (since last stats cycle start): 3160
>
>         Total cycles: 436
>
>         Last cycle when: Mon Jun 24 15:32:31 2019
>
>         Last cycle: 253698
>
>         Max cycle:  12701861
>
>         Mean cycle: 338674
>
>         Last depth cycle: 3
>
>         Last depth cycle (try sched): 3
>
>         Depth Mean: 15
>
>         Depth Mean (try depth): 15
>
>         Last queue length: 13
>
>         Queue length mean: 3
>
>
>
> Remote Procedure Call statistics by message type
>
>         REQUEST_PARTITION_INFO                  ( 2009) count:468871 ave_time:2188   total_time:1026211593
>
>         REQUEST_NODE_INFO_SINGLE                ( 2040) count:421773 ave_time:1775   total_time:748837928
>
>         REQUEST_JOB_INFO                        ( 2003) count:46877  ave_time:696    total_time:32627442
>
>         REQUEST_NODE_INFO                       ( 2007) count:43575  ave_time:1269   total_time:55301255
>
>         REQUEST_JOB_STEP_INFO                   ( 2005) count:38703  ave_time:201    total_time:7805655
>
>         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:29155  ave_time:758    total_time:22118507
>
>         REQUEST_JOB_USER_INFO                   ( 2039) count:22401  ave_time:391    total_time:8763503
>
>         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:7484   ave_time:6164   total_time:46132632
>
>         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:7064   ave_time:79129  total_time:558971262
>
>         REQUEST_PING                            ( 1008) count:3561   ave_time:141    total_time:502289
>
>         REQUEST_STATS_INFO                      ( 2035) count:3236   ave_time:568    total_time:1838784
>
>         REQUEST_BUILD_INFO                      ( 2001) count:2598   ave_time:7869   total_time:20445066
>
>         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:581    ave_time:132730 total_time:77116427
>
>         REQUEST_STEP_COMPLETE                   ( 5016) count:408    ave_time:4373   total_time:1784564
>
>         REQUEST_JOB_STEP_CREATE                 ( 5001) count:326    ave_time:14832  total_time:4835389
>
>         REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:302    ave_time:15754  total_time:4757813
>
>         REQUEST_JOB_READY                       ( 4019) count:78     ave_time:1615   total_time:125980
>
>         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:48     ave_time:7851   total_time:376856
>
>         REQUEST_KILL_JOB                        ( 5032) count:38     ave_time:245    total_time:9346
>
>         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:28     ave_time:12730  total_time:356466
>
>         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:28     ave_time:20504  total_time:574137
>
>         REQUEST_CANCEL_JOB_STEP                 ( 5005) count:7      ave_time:43665  total_time:305661
>
>
>
> Remote Procedure Call statistics by user
>
>         xxxxx           (       0) count:979383 ave_time:2500   total_time:2449350389
>
>         xxxxx           (   11160) count:116109 ave_time:695    total_time:80710478
>
>         xxxxx           (   11427) count:1264   ave_time:67572  total_time:85411027
>
>         xxxxx           (   11426) count:149    ave_time:7361   total_time:1096874
>
>         xxxxx           (   12818) count:136    ave_time:11354  total_time:1544190
>
>         xxxxx           (   12475) count:37     ave_time:4985   total_time:184452
>
>         xxxxx           (   12487) count:36     ave_time:30318  total_time:1091483
>
>         xxxxx           (   11147) count:12     ave_time:33489  total_time:401874
>
>         xxxxx           (   11345) count:6      ave_time:584    total_time:3508
>
>         xxxxx           (   12876) count:6      ave_time:483    total_time:2900
>
>         xxxxx           (   11457) count:4      ave_time:345    total_time:1380
>
>
>
> Any suggestions/tips are helpful.
>
> Rgds
>
>
>
> Click here to report this email as spam.