kvccl: re-order enterprise check in canSendToFollower by nvb · Pull Request #62465 · cockroachdb/cockroach

nvb · 2021-03-23T16:06:02Z

In #62447, Erik found that #59571 had re-ordered the call to
utilccl.CheckEnterpriseEnabled to occur before checking the batch in
canSendToFollower, instead of after. This added an error allocation
into the hot path of all reads, which showed up in CPU profiles and
caused an 8% performance regression on kv95. This commit fixes this by
moving the enterprise check back out of the hot-path for all non-stale
read-only batches.

A follow up to this PR would be to make utilccl.CheckEnterpriseEnabled
cheaper by avoiding the error allocation for callers that don't need an
error. This work is not done in this commit.

Fixes cockroachdb#62447. In cockroachdb#62447, Erik found that cockroachdb#59571 had re-ordered the call to `utilccl.CheckEnterpriseEnabled` to occur before checking the batch in `canSendToFollower`, instead of after. This added an error allocation into the hot path of all reads, which showed up in CPU profiles and caused an 8% performance regression on `kv95`. This commit fixes this by moving the enterprise check back out of the hot-path for all non-stale read-only batches. A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled` cheaper by avoiding the error allocation for callers that don't need an error. This work is not done in this commit.

cockroach-teamcity · 2021-03-23T16:06:09Z

This change is

nvb · 2021-03-23T16:06:21Z

I still need to run kv95 to demonstrate that this has fixed the perf regression.

erikgrinaker

Good place to start, can look into the enterprise check follow-up. Let me know if you'd like a hand with kv95 benchmarks as well.

Reviewable status: complete! 1 of 0 LGTMs obtained

nvb · 2021-03-23T16:28:52Z

Let me know if you'd like a hand with kv95 benchmarks as well.

All good, I can kick them off. Do you have a pointer to which kv95 roachtest you were running? kv95/enc=false/nodes=1/cpu=32 on GCE?

erikgrinaker · 2021-03-23T16:30:10Z

Yep. We usually do --count 5 --cpu-quota 320 and pick the median.

nvb · 2021-03-23T17:14:07Z

I'm not seeing much of a top-line improvement with this change when using benchstat to look at statistical significance over 10 runs, but the improvement is pretty clear from CPU profiles taken before and after:

name                           old ops/s    new ops/s    delta
kv95/enc=false/nodes=1/cpu=32   65.3k ± 4%   66.5k ± 5%  +1.86%  (p=0.094 n=9+9)

name                           old avg(ms)  new avg(ms)  delta
kv95/enc=false/nodes=1/cpu=32    1.00 ± 0%    1.00 ± 0%    ~     (all equal)

name                           old p99(ms)  new p99(ms)  delta
kv95/enc=false/nodes=1/cpu=32    5.37 ± 8%    5.08 ± 8%  -5.38%  (p=0.045 n=9+9)

Before

After

It's possible that the call to (*hlc.Clock).Now is more expensive that the call to timeutil.Since, and that is also contributing to the regression we're seeing. I'll try mocking out the entire function and seeing how that performs.

nvb · 2021-03-23T17:35:26Z

I'll try mocking out the entire function and seeing how that performs.

I ran another 10 runs with the following patch, which makes closedTimestampLikelySufficient a no-op.

diff --git a/pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go b/pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go
index 8d4695a..eda6adc 100644
--- a/pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go
+++ b/pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go
@@ -102,6 +102,9 @@ func closedTimestampLikelySufficient(
        ctPolicy roachpb.RangeClosedTimestampPolicy,
        requiredFrontierTS hlc.Timestamp,
 ) bool {
+       if true {
+               return false
+       }
        var offset time.Duration
        switch ctPolicy {
        case roachpb.LAG_BY_CLUSTER_SETTING:

This resulted in the following change in performance:

name                           old ops/s    new ops/s    delta
kv95/enc=false/nodes=1/cpu=32   66.5k ± 5%   66.9k ± 3%    ~     (p=0.661 n=9+10)

name                           old avg(ms)  new avg(ms)  delta
kv95/enc=false/nodes=1/cpu=32    1.00 ± 0%    0.95 ± 5%  -5.00%  (p=0.007 n=7+10)

name                           old p99(ms)  new p99(ms)  delta
kv95/enc=false/nodes=1/cpu=32    5.08 ± 8%    5.00 ± 0%    ~     (p=0.286 n=9+6)

So while there are likely some wins here to be had, likely around exposing a version of (*hlc.Clock).Now that does not tick the HLC, that isn't responsible for the rest of the 8% perf hit referenced in #62447, so I don't think there's anything more to do here now.

erikgrinaker · 2021-03-23T17:38:39Z

Hm, if we're hitting this often enough for the clock to be significant here, I'd be even more worried about hitting this for users that do have a license:

cockroach/pkg/ccl/utilccl/license_check.go

Lines 120 to 127 in c834bd8

    
           // FIXME(tschottdorf): see whether it makes sense to cache the decoded 
        
           // license. 
        
           if str := enterpriseLicense.Get(&st.SV); str != "" { 
        
           	var err error 
        
           	if lic, err = decode(str); err != nil { 
        
           		return err 
        
           	} 
        
           }

Maybe we need to consider caching the enterprise settings?

nvb · 2021-03-23T17:41:24Z

Agreed. With this PR, the clock is hit on the hot path for all requests but the enterprise license is only hit for follower reads, instead of the other way around, but I agree that we should make that check cheaper if it's going to be on the hot path of any type of request.

nvb · 2021-03-23T18:28:46Z

I ran kv95/enc=false/nodes=1/cpu=32 10 times on 0ac8ab9 and then on its parent be115b9 and am seeing a small regression, but not on the same order of magnitude as the 8% drop cited in #62447.

name                           old ops/s    new ops/s    delta
kv95/enc=false/nodes=1/cpu=32   72.6k ± 1%   72.0k ± 2%  -0.83%  (p=0.089 n=10+10)

name                           old avg(ms)  new avg(ms)  delta
kv95/enc=false/nodes=1/cpu=32    0.90 ± 0%    0.90 ± 0%    ~     (all equal)

name                           old p99(ms)  new p99(ms)  delta
kv95/enc=false/nodes=1/cpu=32    4.61 ± 8%    4.82 ± 8%    ~     (p=0.109 n=10+10)

So I feel comfortable with merging this fix and calling it a day. Thanks again for finding this @erikgrinaker!

bors r+

erikgrinaker · 2021-03-23T18:38:08Z

Thanks! I'll do a couple more runs as well to recheck the numbers -- kv95 does tend do be a bit unreliable.

craig · 2021-03-23T19:06:55Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-03-23T20:55:37Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker · 2021-03-23T21:10:29Z

Opened #62489 for CheckEnterpriseEnabled optimizations.

erikgrinaker · 2021-03-24T10:30:54Z

I see significant gains with this patch, going from 61656 ops/s before to 65841 ops/s after on kv95/enc=false/nodes=1/cpu=32. This is consistent with the original drop in performance.

nvb requested a review from erikgrinaker March 23, 2021 16:06

erikgrinaker approved these changes Mar 23, 2021

View reviewed changes

erikgrinaker mentioned this pull request Mar 23, 2021

20.2 → 21.1 roachperf benchmark regressions #62322

Closed

15 tasks

craig bot merged commit 822187c into cockroachdb:master Mar 23, 2021

nvb deleted the nvanbenschoten/followerReadEnterprise branch March 25, 2021 00:47

erikgrinaker mentioned this pull request Mar 25, 2021

kv: investigate kv95 performance after Feb 18 #62359

Closed

nvb mentioned this pull request Mar 25, 2021

release-21.1: kvccl: re-order enterprise check in canSendToFollower #62606

Merged

Conversation

nvb commented Mar 23, 2021

Uh oh!

cockroach-teamcity commented Mar 23, 2021

Uh oh!

nvb commented Mar 23, 2021

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Mar 23, 2021

Uh oh!

erikgrinaker commented Mar 23, 2021

Uh oh!

nvb commented Mar 23, 2021

Before

After

Uh oh!

nvb commented Mar 23, 2021

Uh oh!

erikgrinaker commented Mar 23, 2021

Uh oh!

nvb commented Mar 23, 2021

Uh oh!

nvb commented Mar 23, 2021

Uh oh!

erikgrinaker commented Mar 23, 2021

Uh oh!

craig bot commented Mar 23, 2021

Uh oh!

craig bot commented Mar 23, 2021

Uh oh!

erikgrinaker commented Mar 23, 2021

Uh oh!

erikgrinaker commented Mar 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants