[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Buckley, Ronan Ronan.Buckley at Dell.com
Tue Jun 25 10:28:45 UTC 2019


Is there a way to diagnose if the I/O to the /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the NFS storage is the cause of the socket errors?
What values/threshold from the nfsiostat command would signal the NFS storage as the bottleneck?

From: Buckley, Ronan
Sent: Tuesday, June 25, 2019 11:21 AM
To: Slurm User Community List; slurm-users-bounces at lists.schedmd.com
Subject: RE: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Hi,

I can reproduce the problem by submitting a job array of 700+.
The slurmctld log file is also regularly outputting:

[2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt

The max_rpc_cnt is currently set to its default of zero.

Rgds

Ronan

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Marcelo Garcia
Sent: Tuesday, June 25, 2019 10:35 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2


[EXTERNAL EMAIL]
Hi

It seems a problem we discussed a few days ago:
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you have this problem randomly or it's always happening on your jobs?

I tried to get an idea how many RPCs would be OK, but I got no reply
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html
My take is that there is no answer to the question, each site is different.

Best Regards

mg.

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
Sent: Dienstag, 25. Juni 2019 11:17
To: 'slurm-users at lists.schedmd.com' <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>; slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>
Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Hi,

Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:

Socket timed out on send/recv operation

It sometimes occurs when a job array is started and squeue will display the error:

slurm_load_jobs error: Socket timed out on send/recv operation

We also see the following errors:

slurm_load_jobs error: Zero Bytes were transmitted or received
srun: error: Unable to allocate resources: Zero Bytes were transmitted or received

sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?

Server thread count: 3
Agent queue size:    0

Jobs submitted: 14279
Jobs started:   7709
Jobs completed: 7001
Jobs canceled:  38
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   788
        Max cycle:    461780
        Total cycles: 3319
        Mean cycle:   7589
        Mean depth cycle:  3
        Cycles per minute: 4
        Last queue length: 13

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
        Total backfilled jobs (since last slurm start): 3204
        Total backfilled jobs (since last stats cycle start): 3160
        Total cycles: 436
        Last cycle when: Mon Jun 24 15:32:31 2019
        Last cycle: 253698
        Max cycle:  12701861
        Mean cycle: 338674
        Last depth cycle: 3
        Last depth cycle (try sched): 3
        Depth Mean: 15
        Depth Mean (try depth): 15
        Last queue length: 13
        Queue length mean: 3

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:468871 ave_time:2188   total_time:1026211593
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:421773 ave_time:1775   total_time:748837928
        REQUEST_JOB_INFO                        ( 2003) count:46877  ave_time:696    total_time:32627442
        REQUEST_NODE_INFO                       ( 2007) count:43575  ave_time:1269   total_time:55301255
        REQUEST_JOB_STEP_INFO                   ( 2005) count:38703  ave_time:201    total_time:7805655
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:29155  ave_time:758    total_time:22118507
        REQUEST_JOB_USER_INFO                   ( 2039) count:22401  ave_time:391    total_time:8763503
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:7484   ave_time:6164   total_time:46132632
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:7064   ave_time:79129  total_time:558971262
        REQUEST_PING                            ( 1008) count:3561   ave_time:141    total_time:502289
        REQUEST_STATS_INFO                      ( 2035) count:3236   ave_time:568    total_time:1838784
        REQUEST_BUILD_INFO                      ( 2001) count:2598   ave_time:7869   total_time:20445066
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:581    ave_time:132730 total_time:77116427
        REQUEST_STEP_COMPLETE                   ( 5016) count:408    ave_time:4373   total_time:1784564
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:326    ave_time:14832  total_time:4835389
        REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:302    ave_time:15754  total_time:4757813
        REQUEST_JOB_READY                       ( 4019) count:78     ave_time:1615   total_time:125980
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:48     ave_time:7851   total_time:376856
        REQUEST_KILL_JOB                        ( 5032) count:38     ave_time:245    total_time:9346
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:28     ave_time:12730  total_time:356466
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:28     ave_time:20504  total_time:574137
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:7      ave_time:43665  total_time:305661

Remote Procedure Call statistics by user
        xxxxx           (       0) count:979383 ave_time:2500   total_time:2449350389
        xxxxx           (   11160) count:116109 ave_time:695    total_time:80710478
        xxxxx           (   11427) count:1264   ave_time:67572  total_time:85411027
        xxxxx           (   11426) count:149    ave_time:7361   total_time:1096874
        xxxxx           (   12818) count:136    ave_time:11354  total_time:1544190
        xxxxx           (   12475) count:37     ave_time:4985   total_time:184452
        xxxxx           (   12487) count:36     ave_time:30318  total_time:1091483
        xxxxx           (   11147) count:12     ave_time:33489  total_time:401874
        xxxxx           (   11345) count:6      ave_time:584    total_time:3508
        xxxxx           (   12876) count:6      ave_time:483    total_time:2900
        xxxxx           (   11457) count:4      ave_time:345    total_time:1380

Any suggestions/tips are helpful.
Rgds


Click here<https://www.mailcontrol.com/sr/E3MG1ttEFmzGX2PQPOmvUrn00dwD0CtTR50NQzaa0Hzyu5oRJaiy8o4IRepqswOkHdrQZ5lrk5_gE3KctAewCA==> to report this email as spam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190625/86196631/attachment-0001.html>


More information about the slurm-users mailing list