[slurm-users] Extremely sluggish squeue -p partition

Bas van der Vlies bas.vandervlies at surf.nl
Tue Dec 8 08:54:21 UTC 2020


Just a question do you use allowgroups for partition access? we had similar problems that squeue or other slurm commands hang when slurm fetch the groups form LDAP for
each partition also the one it already fetched. So we configured `nscd` to prevent this.


--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG  Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervlies at surf.nl | www.surf.nl |




> On 8 Dec 2020, at 05:43, Williams, Jenny Avis <jennyw at email.unc.edu> wrote:
> 
> I have an interesting condition that has been going on for a few days that could use the feedback of those more familiar with the way slurm works under the hood.
> Conditions :
> Slurm v20.02.3 The cluster is relatively quiet given the time of year, and the commands are running on the host on which slurmctld is running, so it should not be exiting the host. Sdiag output included at the end.
>  
> Problem:
> Queries done on partitions by either squeue or sacct are very slow, even squeue on an empty partition !!  with the following exceptions:
>  sacct queries for running jobs
>   squeue on jobID
>   squeue on user
>  
> Question:
> What is distinct in the case of squeue on partitions as compared to the sacct on running jobs, or the squeue on user examples, that might account for this behavior?
>  
> #  setup
> # Slow cases
> squeue_running="squeue -p general -t running|wc -l"
> squeue_pending="squeue -p general -t pending|wc -l"
> sacct_pending="sacct -r general -s pending|wc -l"
> empty_squeue="squeue -p staff |wc -l"
>  
>  
>  
> # Fast cases
> sacct_running="sacct -r general -s running|wc -l"
> squeue_user="squeue -u auser |wc -l"
>  
>  
>  
> # time eval $squeue_pending
> 17642
>  
> real    0m30.771s
> user    0m0.392s
> sys     0m0.835s
>  
> # time eval $squeue_running
> 5175
>  
> real    0m31.137s
> user    0m0.304s
> sys     0m0.831s
>  
> # time eval $empty_squeue
> 1
>  
> real    0m31.563s
> user    0m0.279s
> sys     0m0.844s
>  
> # time eval $sacct_pending
> 17529
>  
> real    0m25.901s
> user    0m0.332s
> sys     0m0.175s
>  
>  
>  
> #  time eval $sacct_running
> 15541
>  
> real    0m1.135s
> user    0m0.149s
> sys     0m0.062s
>  
> # time eval $squeue_user 
> 3086
>  
> real    0m0.422s
> user    0m0.029s
> sys     0m0.031s
>  
> # time sacct -r staff
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
>  
> real    0m0.209s
> user    0m0.007s
> sys     0m0.002s
>  
>  
>  
>  
>  
>  
>  
> *******************************************************
> sdiag output at Mon Dec 07 16:25:37 2020 (1607376337)
> Data since      Mon Dec 07 15:09:07 2020 (1607371747)
> *******************************************************
> Server thread count:  16
> Agent queue size:     0
> Agent count:          0
> Agent thread count:   0
> DBD Agent queue size: 0
>  
> Jobs submitted: 3550
> Jobs started:   5064
> Jobs completed: 4683
> Jobs canceled:  83
> Jobs failed:    0
>  
> Job states ts:  Mon Dec 07 16:24:53 2020 (1607376293)
> Jobs pending:   39099
> Jobs running:   5915
>  
> Main schedule statistics (microseconds):
>         Last cycle:   96597
>         Max cycle:    1992372
>         Total cycles: 338
>         Mean cycle:   404511
>         Mean depth cycle:  1195
>         Cycles per minute: 4
>         Last queue length: 31916
>  
> Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
>         Total backfilled jobs (since last slurm start): 4610
>         Total backfilled jobs (since last stats cycle start): 4610
>         Total backfilled heterogeneous job components: 0
>         Total cycles: 71
>         Last cycle when: Mon Dec 07 16:24:25 2020 (1607376265)
>         Last cycle: 23893637
>         Max cycle:  43863303
>         Mean cycle: 22141327
>         Last depth cycle: 22840
>         Last depth cycle (try sched): 1318
>         Depth Mean: 23541
>         Depth Mean (try depth): 1299
>         Last queue length: 31917
>         Queue length mean: 31864
>         Last table size: 164
>         Mean table size: 114
>  
> Latency for 1000 calls to gettimeofday(): 19 microseconds
>  
> Remote Procedure Call statistics by message type
>         REQUEST_PARTITION_INFO                  ( 2009) count:16568  ave_time:314    total_time:5202676
>         REQUEST_FED_INFO                        ( 2049) count:7822   ave_time:194    total_time:1520744
>         REQUEST_JOB_USER_INFO                   ( 2039) count:7756   ave_time:70747  total_time:548714078
>         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:5626   ave_time:119832 total_time:674179337
>         REQUEST_COMPLETE_PROLOG                 ( 6018) count:5068   ave_time:2126092 total_time:10775038166
>         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:4774   ave_time:1044194 total_time:4984985393
>         REQUEST_STEP_COMPLETE                   ( 5016) count:4770   ave_time:1149196 total_time:5481666326
>         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:4746   ave_time:4107848 total_time:19495849569
>         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:3180   ave_time:893436 total_time:2841126933
>         REQUEST_NODE_INFO                       ( 2007) count:2994   ave_time:61229  total_time:183321804
>         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:369    ave_time:365319 total_time:134802881
>         REQUEST_JOB_INFO                        ( 2003) count:219    ave_time:5586738 total_time:1223495821
>         REQUEST_KILL_JOB                        ( 5032) count:83     ave_time:3698948 total_time:307012762
>         REQUEST_PRIORITY_FACTORS                ( 2026) count:48     ave_time:302008 total_time:14496396
>         REQUEST_UPDATE_NODE                     ( 3002) count:30     ave_time:234937 total_time:7048123
>         REQUEST_BUILD_INFO                      ( 2001) count:27     ave_time:102471 total_time:2766731
>         REQUEST_JOB_READY                       ( 4019) count:23     ave_time:128717 total_time:2960508
>         REQUEST_PING                            ( 1008) count:15     ave_time:139    total_time:2088
>         REQUEST_CANCEL_JOB_STEP                 ( 5005) count:8      ave_time:1320084 total_time:10560677
>         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:7      ave_time:2417071 total_time:16919499
>         REQUEST_NODE_INFO_SINGLE                ( 2040) count:6      ave_time:189125 total_time:1134750
>         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:6      ave_time:4771306 total_time:28627836
>         REQUEST_UPDATE_PARTITION                ( 3005) count:6      ave_time:633112 total_time:3798672
>         REQUEST_JOB_STEP_CREATE                 ( 5001) count:5      ave_time:1015017 total_time:5075087
>         REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0      total_time:0
> ..
> Pending RPC statistics
>         No pending RPCs

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4182 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201208/f5827c7d/attachment.bin>


More information about the slurm-users mailing list