sql: introduce BenchmarkWriter benchmark by Azhng · Pull Request #76244 · cockroachdb/cockroach

Azhng · 2022-02-08T17:49:31Z

This commit introduces the new BenchmarkWriter benchmark to measure the
write performance for TxnID Cache.

Release note: None

cockroach-teamcity · 2022-02-08T17:49:40Z

This change is

ajwerner · 2022-02-08T19:14:45Z

pkg/sql/contention/txnidcache/writer_test.go

+	for _, sinkType := range []string{"blackHoleSink", "realSink"} {
+		b.Run(fmt.Sprintf("sinkType=%s", sinkType), func(b *testing.B) {
+			for _, numOfConcurrentWriter := range []int{1, 24, 48, 64, 92, 128} {
+				b.Run(fmt.Sprintf("concurrentWriter=%d", numOfConcurrentWriter), func(b *testing.B) {
+					starter := make(chan struct{})


nit: I think you can clean it up a bit

run := func(b *testing.B, sink messageSink) { w := newWriter(sink, blockPool) starter := make(chan struct{}) b.SetBytes(messageBlockSize * entrySize) var wg sync.WaitGroup for writerIdx := 0; writerIdx < numOfConcurrentWriter; writerIdx++ { // ... } b.ResetTimer() close(starter) wg.Wait() } for _, sinkType := range []struct { name string new func() (_ messageSink, cleanup func() }{ { name: "blackHole", new: func() (messageSink, func()) { blackHole := newBlackHoleSink(blockPool, channelSize) blackHole.start() return blackHold, blackHole.stop } // ... } { //...

ajwerner · 2022-02-08T19:14:51Z

pkg/sql/contention/txnidcache/writer_test.go

+
+							<-starter
+
+							for i := 0; i < b.N; i++ {


it's probably interesting to divide b.N by numOfConcurrentWriter

Done.

The updated benchmark number

$ benchstat ~/tmp/blackhole ~/tmp/real [14:58:33] name old time/op new time/op delta Writer/concurrentWriter=1-24 72.2ns ± 0% 788.1ns ± 6% +991.61% (p=0.000 n=9+10) Writer/concurrentWriter=24-24 117ns ±11% 227ns ± 9% +94.19% (p=0.000 n=10+10) Writer/concurrentWriter=48-24 123ns ±16% 176ns ± 9% +43.10% (p=0.000 n=10+9) Writer/concurrentWriter=64-24 122ns ± 2% 181ns ± 2% +47.90% (p=0.000 n=8+8) Writer/concurrentWriter=92-24 142ns ± 8% 164ns ± 4% +15.06% (p=0.000 n=10+10) Writer/concurrentWriter=128-24 160ns ± 7% 174ns ± 5% +8.74% (p=0.019 n=10+10) name old speed new speed delta Writer/concurrentWriter=1-24 340GB/s ± 0% 31GB/s ± 6% -90.83% (p=0.000 n=9+10) Writer/concurrentWriter=24-24 211GB/s ±10% 108GB/s ± 9% -48.61% (p=0.000 n=10+10) Writer/concurrentWriter=48-24 201GB/s ±14% 140GB/s ± 9% -30.47% (p=0.000 n=10+9) Writer/concurrentWriter=64-24 201GB/s ± 2% 136GB/s ± 2% -32.39% (p=0.000 n=8+8) Writer/concurrentWriter=92-24 173GB/s ± 8% 150GB/s ± 4% -13.38% (p=0.000 n=10+10) Writer/concurrentWriter=128-24 154GB/s ± 7% 142GB/s ± 5% -8.15% (p=0.019 n=10+10) name old alloc/op new alloc/op delta Writer/concurrentWriter=1-24 0.00B 219.00B ± 3% +Inf% (p=0.000 n=10+9) Writer/concurrentWriter=24-24 0.00B 22.60B ± 3% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=48-24 0.00B 19.00B ± 0% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=64-24 0.00B 19.00B ± 0% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=92-24 0.00B 17.60B ± 3% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=128-24 0.00B 17.30B ± 4% +Inf% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Writer/concurrentWriter=1-24 0.00 1.00 ± 0% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=24-24 0.00 1.00 ± 0% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=48-24 0.00 1.00 ± 0% +Inf% (p=0.000 n=10+10) Writer/concurrentWriter=64-24 0.00 0.00 ~ (all equal) Writer/concurrentWriter=92-24 0.00 0.00 ~ (all equal) Writer/concurrentWriter=128-24 0.00 0.00 ~ (all equal)

I feel like this benchmark is not capturing the reality of what's going on in the kv benchmark. 1 extra microsecond of computation time would not lead to the 8% drop in kv0 throughput we're seeing. I think it's worth keeping in mind the ways in which microbenchmarks don't tell the full picture.

Did you try running the kv0 benchmark on a cluster and looking at profiles?

CPU Profile

Memory Profile

Heap Profile

I'm not too sure how to interpret the graphs here. It seems like there is a lot of time spent in the cache.UnorderedCache to perform eviction, and there some allocations performed in cache.UnorderedCache.

I'm not sure if all the small allocation made by the cache.UnorderedCache can cause more GC pressure? My reasoning here is that the kv workload is throwing nearing 20k QPS and since we perform 2 writes per operation, we are performing 40k writes per second. That would be a lot of small object allocated on the heap (used for the linked list used to perform FIFO eviction).

I did another kv95 run by turning txnidcache.Cache into effectively a no-op sink. It actually performs on par with before TxnIDCache was introduced.

Branch elapsed errors ops(total) ops/sec(cum) avg(ms) p50(ms) p95(ms) p99(ms) pMax(ms)

baseline 2700.0s 0 55785736 20661.4 3.1 2.4 8.4 13.1 75.5

cache-normal 2700.0s 0 49867068 18469.3 3.5 2.5 10.0 17.8 671.1

cache-noop-sink 2700.0s 0 56440227 20903.8 3.1 2.4 8.1 13.1 71.3

I think at this point it's reasonable to believe that the slowdown comes from the sink which couldn't keep up with the incoming writes, causing the channel to fill up which resulted in back pressure and cause the concurrentWriteBuffer to wait.

I think at this point there are a few options:

have multiple cache.UnorderedCache and each cache has its own goroutine / channel.

try out using the ring buffer. This might introduce slightly slower read, but I think since read doesn't happen during the query execution hot path, it's an acceptable trade off.

maybe both?

I will try out these two ideas tomorrow.

Here was the diff I introduced

diff --git a/pkg/sql/contention/txnidcache/txn_id_cache.go b/pkg/sql/contention/txnidcache/txn_id_cache.go index 9384ed7044..e335ec6e39 100644 --- a/pkg/sql/contention/txnidcache/txn_id_cache.go +++ b/pkg/sql/contention/txnidcache/txn_id_cache.go @@ -171,14 +171,15 @@ func NewTxnIDCache(st *cluster.Settings, metrics *Metrics) *Cache { // Start implements the Provider interface. func (t *Cache) Start(ctx context.Context, stopper *stop.Stopper) { addBlockToStore := func(msgBlock *messageBlock) { - t.mu.Lock() - defer t.mu.Unlock() - for blockIdx := range msgBlock { - if !msgBlock[blockIdx].valid() { - break - } - t.mu.store.Add(msgBlock[blockIdx].TxnID, msgBlock[blockIdx].TxnFingerprintID) - } + // TODO(azhng): wip: imagine if we are using a blackhoel sink + // t.mu.Lock() + // defer t.mu.Unlock() + // for blockIdx := range msgBlock { + // if !msgBlock[blockIdx].valid() { + // break + // } + // t.mu.store.Add(msgBlock[blockIdx].TxnID, msgBlock[blockIdx].TxnFingerprintID) + // } } consumeBlock := func(b *messageBlock) {

ajwerner

LGTM

ajwerner · 2022-02-08T20:01:07Z

pkg/sql/contention/txnidcache/writer_test.go

+			name: "blackHoleSink",
+			new: func() (_ messageSink, _ *sync.Pool, cleanup func()) {
+				blockPool := &sync.Pool{
+					New: func() interface{} {
+						return &messageBlock{}
+					},
+				}
+
+				blackHole := newBlackHoleSink(blockPool, channelSize)
+				blackHole.start()
+
+				return blackHole, blockPool, blackHole.stop
+			},
+		},
+		{
+			name: "realSink",


nit: you don't need the word Sink in these names, you already have sink=

maryliag

just one tiny nit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @Azhng)

pkg/sql/contention/txnidcache/writer_test.go, line 30 at r2 (raw file):

	pool *sync.Pool

	// Simulate a real sink

nit: period

Azhng · 2022-02-09T21:58:34Z

Chatted with @ajwerner offline, updated the benchmark to populate TxnIDFingerprint field in order to defeat Golang's small integer optimization. However, instead of pre-populating a slice of randomly generated inputs, I just use the index value to generate some unique input. This is because I noticed that, when same value is inserted more than once, the
subsequent insertion might not create an allocation. (which is unrealistic in the actual workload)

New benchmark yields significantly bigger drop in throughput when the realsink is used.

azhng@crlMBP-C02F814DMD6TMTA1: ~/go/src/github.com/cockroachdb/cockroach txn-id-writer-bench ⚡
$ benchstat ~/tmp/blackhole ~/tmp/real                                                                                                                         [16:52:41]
name                            old time/op    new time/op    delta
Writer/concurrentWriter=1-24      76.7ns ± 2%   786.8ns ± 7%  +925.35%  (p=0.000 n=10+10)
Writer/concurrentWriter=24-24      109ns ± 1%     760ns ±15%  +596.05%  (p=0.000 n=8+10)
Writer/concurrentWriter=48-24      116ns ± 3%     735ns ±12%  +535.04%  (p=0.000 n=8+10)
Writer/concurrentWriter=64-24      121ns ± 3%     775ns ± 8%  +538.66%  (p=0.000 n=8+10)
Writer/concurrentWriter=92-24      136ns ± 5%     790ns ± 4%  +480.75%  (p=0.000 n=9+10)
Writer/concurrentWriter=128-24     161ns ±12%     782ns ± 6%  +384.94%  (p=0.000 n=10+10)

name                            old speed      new speed      delta
Writer/concurrentWriter=1-24     320GB/s ± 2%    31GB/s ± 7%   -90.23%  (p=0.000 n=10+10)
Writer/concurrentWriter=24-24    225GB/s ± 1%    33GB/s ±17%   -85.53%  (p=0.000 n=8+10)
Writer/concurrentWriter=48-24    212GB/s ± 3%    34GB/s ±13%   -84.18%  (p=0.000 n=8+10)
Writer/concurrentWriter=64-24    203GB/s ± 2%    32GB/s ± 7%   -84.32%  (p=0.000 n=8+10)
Writer/concurrentWriter=92-24    181GB/s ± 5%    31GB/s ± 3%   -82.78%  (p=0.000 n=9+10)
Writer/concurrentWriter=128-24   153GB/s ±11%    31GB/s ± 6%   -79.43%  (p=0.000 n=10+10)

name                            old alloc/op   new alloc/op   delta
Writer/concurrentWriter=1-24       0.00B        238.67B ± 2%     +Inf%  (p=0.000 n=10+9)
Writer/concurrentWriter=24-24      0.00B        239.12B ± 2%     +Inf%  (p=0.000 n=10+8)
Writer/concurrentWriter=48-24      0.00B        241.88B ± 2%     +Inf%  (p=0.000 n=10+8)
Writer/concurrentWriter=64-24      0.00B        239.20B ± 2%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=92-24      0.00B        238.00B ± 1%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=128-24     0.00B        237.90B ± 2%     +Inf%  (p=0.000 n=10+10)

name                            old allocs/op  new allocs/op  delta
Writer/concurrentWriter=1-24        0.00           2.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=24-24       0.00           2.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=48-24       0.00           2.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=64-24       0.00           2.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=92-24       0.00           2.00 ± 0%     +Inf%  (p=0.000 n=10+10)
Writer/concurrentWriter=128-24      0.00           2.00 ± 0%     +Inf%  (p=0.000 n=10+10)

This commit introduces the new BenchmarkWriter benchmark to measure the write performance for TxnID Cache. Informs cockroachdb#76013 Release note: None

76350: sql: introduce new FIFO cache for txnID cache r=Azhng a=Azhng Reviewer note: first commit belongs to #76244 --- Previously, txnID cache relied on cache.UnorderedCache for storage and FIFO eviction behavior. However, due to unintended allocation behavior within cache.UnorderedCache, this yielded about 20% throughput drop in kv95 benchmark. This commit introduces a new FIFO cache, built to remove all allocation during write operation. This commit removes all performance hit caused by the previous use of cache.UnorderedCache. Resolves #76013 Release note: None 76590: sql/catalog,migrations: propagate HasPostDeserializationChanges, simplify migrations r=ajwerner a=ajwerner This is an imperfect change that feels a bit ad-hoc for my liking, but does improve the situation. The problem is that in recent changes, we no longer just retrieve mutable descriptors and then sometimes clone Immutable descriptors. Instead we create the more general Immutable descriptors and clone out mutables. The problem is that when we did this, we threw away information regarding whether the descriptor itself had been modified by post-deserialization. In general, this is mostly an exercise in plumbing. The key feature is that when we retreive a Mutable descriptor and it was changed, we need to know that. This change tests that. There is some ad-hoc code to propagate isUncommitedVersion in various places which I don't feel great about, but it also felt wrong not to. A follow-up should come through to really nail down the properties here. The existence of NewBuilder and the fact that it's called in various places is another mechanism by which this information could be erased, but that's less of a concern for me. This change makes it possible to simplify migrations around descriptor upgrades. ### migrations: simplify descriptor migrations We have two migrations which have the goal of iterating through the descriptors and re-writing them if there are any changes due to the post-deserialization logic. They were needlessly complex and low-level. Probably part of the reason was that we were swallowing this information. This commit reworks them to use shared library code. Probably one of them could be eliminated altogether. Release note: None Co-authored-by: Azhng <archer.xn@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>

Azhng requested review from a team and ajwerner February 8, 2022 17:49

Azhng mentioned this pull request Feb 8, 2022

roachperf: kv95 regression around February 1st #76013

Closed

ajwerner reviewed Feb 8, 2022

View reviewed changes

Azhng force-pushed the txn-id-writer-bench branch from 9a32847 to 193c423 Compare February 8, 2022 19:59

ajwerner approved these changes Feb 8, 2022

View reviewed changes

maryliag approved these changes Feb 9, 2022

View reviewed changes

Azhng force-pushed the txn-id-writer-bench branch 2 times, most recently from 7857ad0 to edbbefb Compare February 9, 2022 21:54

This was referenced Feb 10, 2022

sql: introduce new FIFO cache for txnID cache #76350

Merged

sql: contention event store main tracking issue #74485

Closed

sql: introduce BenchmarkWriter benchmark

00fd06a

This commit introduces the new BenchmarkWriter benchmark to measure the write performance for TxnID Cache. Informs cockroachdb#76013 Release note: None

Azhng force-pushed the txn-id-writer-bench branch from edbbefb to 00fd06a Compare February 14, 2022 16:39

craig bot merged commit 00fd06a into cockroachdb:master Feb 16, 2022

Branch	elapsed	ops(total)	ops/sec(cum)	avg(ms)	p50(ms)	p95(ms)	p99(ms)	pMax(ms)
baseline	2700.0s	55785736	20661.4	3.1	2.4	8.4	13.1	75.5
cache-normal	2700.0s	49867068	18469.3	3.5	2.5	10.0	17.8	671.1
cache-noop-sink	2700.0s	56440227	20903.8	3.1	2.4	8.1	13.1	71.3

Conversation

Azhng commented Feb 8, 2022

Uh oh!

cockroach-teamcity commented Feb 8, 2022

Uh oh!

ajwerner Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryliag left a comment

Choose a reason for hiding this comment

Uh oh!

Azhng commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajwerner Feb 8, 2022 •

edited

Loading