sql: introduce new FIFO cache for txnID cache by Azhng · Pull Request #76350 · cockroachdb/cockroach

Azhng · 2022-02-10T00:22:31Z

Reviewer note: first commit belongs to #76244

Previously, txnID cache relied on cache.UnorderedCache for storage and
FIFO eviction behavior. However, due to unintended allocation behavior
within cache.UnorderedCache, this yielded about 20% throughput drop in
kv95 benchmark.

This commit introduces a new FIFO cache, built to remove all allocation
during write operation. This commit removes all performance hit caused
by the previous use of cache.UnorderedCache.

Resolves #76013

Release note: None

cockroach-teamcity · 2022-02-10T00:22:39Z

This change is

Azhng · 2022-02-10T00:23:50Z

kv95 benchmark:

Branch	elapsed	ops(total)	ops/sec(cum)	avg(ms)	p50(ms)	p95(ms)	p99(ms)	pMax(ms)
no-cache	2700.0s	55785736	20661.4	3.1	2.4	8.4	13.1	75.5
old-cache	2700.0s	49867068	18469.3	3.5	2.5	10.0	17.8	671.1
new-cache	2700.0s	60666291	22469.0	2.8	2.2	7.6	11.0	75.5

Microbenchmark:

$ benchstat ~/tmp/BenchmarkWriter-master/bench ~/tmp/BenchmarkWriter-typed-cache/bench | grep -v 'blackHole'                                                   [18:27:15]
name                                               old time/op    new time/op    delta
Writer/sinkType=real/concurrentWriter=1-24            787ns ± 7%     363ns ± 1%   -53.87%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=24-24           760ns ±15%     366ns ± 2%   -51.80%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24           735ns ±12%     364ns ± 1%   -50.49%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=64-24           775ns ± 8%     360ns ± 2%   -53.51%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=92-24           790ns ± 4%     358ns ± 5%   -54.63%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24          782ns ± 6%     356ns ± 3%   -54.47%  (p=0.000 n=10+10)

name                                               old speed      new speed      delta
Writer/sinkType=real/concurrentWriter=1-24         31.3GB/s ± 7%  67.7GB/s ± 1%  +116.40%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=24-24        32.6GB/s ±17%  67.1GB/s ± 2%  +105.99%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24        33.6GB/s ±13%  67.6GB/s ± 1%  +101.01%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=64-24        31.8GB/s ± 7%  68.2GB/s ± 2%  +114.76%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=92-24        31.1GB/s ± 3%  68.6GB/s ± 5%  +120.46%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24       31.5GB/s ± 6%  69.0GB/s ± 3%  +119.46%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
Writer/sinkType=real/concurrentWriter=1-24             239B ± 2%       71B ± 2%   -70.44%  (p=0.000 n=9+9)
Writer/sinkType=real/concurrentWriter=24-24            239B ± 2%       74B ± 6%   -69.18%  (p=0.000 n=8+10)
Writer/sinkType=real/concurrentWriter=48-24            242B ± 2%       73B ± 4%   -69.82%  (p=0.000 n=8+10)
Writer/sinkType=real/concurrentWriter=64-24            239B ± 2%       70B ± 5%   -70.82%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=92-24            238B ± 1%       70B ± 3%   -70.68%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=128-24           238B ± 2%       69B ± 6%   -70.95%  (p=0.000 n=10+10)

name                                               old allocs/op  new allocs/op  delta
Writer/sinkType=real/concurrentWriter=1-24             2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=24-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=64-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=92-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24           2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @Azhng)

pkg/sql/contention/txnidcache/fifo_cache.go, line 68 at r2 (raw file):

	}

	c.mu.eviction.insertBlock(block)

I worry about cases where you flush due to a timer as opposed to because the blocks are full and then you end up with mostly empty blocks in the cache. Perhaps it'd be better to have evictionNode have its own embedded messageBlock (as opposed to *messageBlock) and copy out of the block in add and into the current tail, filling up blocks in the FIFO as you go. Then you know you'll only ever have at most one non-full block. Then, sync.Pool the evictionNode.

Azhng

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/contention/txnidcache/fifo_cache.go, line 68 at r2 (raw file):

Previously, ajwerner wrote…

I worry about cases where you flush due to a timer as opposed to because the blocks are full and then you end up with mostly empty blocks in the cache. Perhaps it'd be better to have evictionNode have its own embedded messageBlock (as opposed to *messageBlock) and copy out of the block in add and into the current tail, filling up blocks in the FIFO as you go. Then you know you'll only ever have at most one non-full block. Then, sync.Pool the evictionNode.

I went down the route of just directly hijacking the blocks and use them in the eviction list. (mostly was because I misunderstood your comment when I first read it).

Thoughts on this approach?

Azhng · 2022-02-10T22:50:56Z

Compared to the previous approach (that risked having many empty blocks) I think the new overhead here is about 7% in the microbenchmark. I will follow up with the kv95 result once it's ready.

azhng@crlMBP-C02F814DMD6TMTA1: ~/go/src/github.com/cockroachdb/cockroach txn-id-faster-cache ⚡
$ benchstat ~/tmp/BenchmarkWriter-typed-cache/bench ~/tmp/BenchmarkWriter-compact-evict-list/bench | grep -v 'blackHole'                                     [17:49:23]
name                                               old time/op    new time/op    delta
Writer/sinkType=real/concurrentWriter=1-24            363ns ± 1%     389ns ± 3%  +7.20%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=24-24           366ns ± 2%     393ns ± 3%  +7.39%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=48-24           364ns ± 1%     388ns ± 3%  +6.64%  (p=0.000 n=9+9)
Writer/sinkType=real/concurrentWriter=64-24           360ns ± 2%     387ns ± 2%  +7.51%  (p=0.000 n=9+10)
Writer/sinkType=real/concurrentWriter=92-24           358ns ± 5%     388ns ± 3%  +8.27%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24          356ns ± 3%     386ns ± 1%  +8.43%  (p=0.000 n=10+10)

name                                               old speed      new speed      delta
Writer/sinkType=real/concurrentWriter=1-24         67.7GB/s ± 1%  63.2GB/s ± 3%  -6.67%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=24-24        67.1GB/s ± 2%  62.5GB/s ± 3%  -6.88%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=48-24        67.6GB/s ± 1%  63.4GB/s ± 3%  -6.19%  (p=0.000 n=9+9)
Writer/sinkType=real/concurrentWriter=64-24        68.2GB/s ± 2%  63.4GB/s ± 2%  -6.98%  (p=0.000 n=9+10)
Writer/sinkType=real/concurrentWriter=92-24        68.6GB/s ± 5%  63.4GB/s ± 3%  -7.67%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24       69.0GB/s ± 3%  63.7GB/s ± 1%  -7.79%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
Writer/sinkType=real/concurrentWriter=1-24            70.6B ± 2%     68.9B ± 3%  -2.36%  (p=0.009 n=9+9)
Writer/sinkType=real/concurrentWriter=24-24           73.7B ± 6%     69.4B ± 5%  -5.83%  (p=0.001 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24           73.0B ± 4%     67.7B ± 6%  -7.26%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=64-24           69.8B ± 5%     69.4B ± 2%    ~     (p=0.975 n=10+8)
Writer/sinkType=real/concurrentWriter=92-24           69.8B ± 3%     70.2B ± 5%    ~     (p=0.576 n=9+10)
Writer/sinkType=real/concurrentWriter=128-24          69.1B ± 6%     70.8B ± 4%  +2.46%  (p=0.034 n=10+10)

name                                               old allocs/op  new allocs/op  delta
Writer/sinkType=real/concurrentWriter=1-24             0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=24-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=48-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=64-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=92-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=128-24           0.00           0.00         ~     (all equal)

Azhng · 2022-02-10T23:40:49Z

Updated kv95 benchmark. Hmm seems like it's a bit slower than the previous simple eviction list implementation.

Branch	elapsed	ops(total)	ops/sec(cum)	avg(ms)	p50(ms)	p95(ms)	p99(ms)	pMax(ms)
baseline	2700.0s	55785736	20661.4	3.1	2.4	8.4	13.1	75.5
cache	2700.0s	49867068	18469.3	3.5	2.5	10.0	17.8	671.1
cache-noop-sink	2700.0s	56440227	20903.8	3.1	2.4	8.1	13.1	71.3
cache-regular-evict-list	2700.0s	60666291	22469.0	2.8	2.2	7.6	11.0	75.5
cache-compact-evict-list	2700.0s	54782029	20289.6	3.2	2.5	8.4	12.6	109.1

Azhng · 2022-02-10T23:44:11Z

Raw files output from both kv95 runs

kv95.compact-list.txt
kv95.normal-list.txt

ajwerner · 2022-02-10T23:58:13Z

pkg/sql/contention/txnidcache/fifo_cache.go

+		blockSize++
+	}
+
+	if blockHijacked := c.mu.eviction.insertBlock(block, blockSize); !blockHijacked {


some of this hijacking feels too clever to me. I feel like just embedding the messageBlock as a value in eviction list and pooling that will be simpler from a data abstraction perspective layer. You can do it as a single for loop over the block. I don't feel that strongly though.

Done. Switched to using pool for node allocation.

ajwerner · 2022-02-10T23:59:01Z

pkg/sql/contention/txnidcache/fifo_cache.go

+	}
+}
+
+type evictionNode struct {


I think these things need to be pooled. I'd like to see the benchmark allocating on average zero or near zero bytes as opposed to the current ~70. I wouldn't focus so much on the runtime in terms of nanoseconds for this, if you get the allocations to zero, my money is that we won't see big impact on the kv workload.

ajwerner · 2022-02-11T01:17:02Z

pkg/sql/contention/txnidcache/fifo_cache.go

+		nodePool: &sync.Pool{
+			New: func() interface{} {
+				return &evictionNode{}
+			},
+		},


sync.Pool objects should be global vars. There's no good reason to have more than one for a given type. Every instance gets registered in the runtime and updated on every gc cycle.

Done. Also created #76418 to clean up the rest of the sync.Pool usage.

This commit introduces the new BenchmarkWriter benchmark to measure the write performance for TxnID Cache. Informs cockroachdb#76013 Release note: None

ajwerner

Just cleanup nits from me. I didn't scrutinize the tests all that closely, but they seemed fine.

Mod the nits, and the naming one really are up to you, 👍 LGTM

ajwerner · 2022-02-15T01:22:46Z

pkg/sql/contention/txnidcache/fifo_cache.go

+}
+
+// insertBlock insert the messageBlock into the eviction list.
+func (e *evictionList) insertBlock(block *messageBlock, blockSize int) {


I think it might be more elegant if you just passed the block in here as a slice. It's an extra word on the stack, but who cares.

ajwerner · 2022-02-15T01:32:48Z

pkg/sql/contention/txnidcache/fifo_cache.go

+	}
+
+	// Handle when the tail block is full.
+	if e.tail.isFull() {


what if, instead of this method, you put a method on the evictionNode which was:

func (e *evictionList) append(msgs []ResolveTxnID) { msgs = e.appendToTail(msgs) for len(msgs) > 0 { e.addNode() msgs = e.appendToTail(msgs) } } func (e *evictionList) addNode() { newNode := nodePool.Get().(*evictionNode) if e.head == nil { e.head = newNode } else { e.tail.next = newNode } e.tail = newNode } func (e *evictionList) appendToTail(msgs []ResolvedTxnID) (remaining []ResolvedTxnID) { if e.head == nil { return msgs } toCopy := messageBlockSize - e.tailBlockIdx if toCopy > len(msgs) { toCopy = len(msgs) } copy(e.tail.messageBlock[:], msgs[:toCopy]) return msgs[toCopy:] }

ajwerner · 2022-02-15T01:34:10Z

pkg/sql/contention/txnidcache/fifo_cache.go

+// fifoCache sometimes directly hijack the input messageBlock to form the
+// eviction list and often even directly overwrite its content to minimize
+// memory allocation.
+func (c *fifoCache) Add(block *messageBlock) {


how come this is exported?

ajwerner · 2022-02-15T01:38:45Z

pkg/sql/contention/txnidcache/concurrent_write_buffer.go

@@ -20,6 +20,10 @@ const messageBlockSize = 1024

 type messageBlock [messageBlockSize]ResolvedTxnID


If I were to be nit-picky, I like I'd just call this block. The fact that things are "messages" sort of muddles things for me. I think I'd go through and eliminate the word message/msg everywhere. messageSink would become blockSink, messageBlockSize->blockSize msgBlockPool->blockPool. Then for the list, maybe blockListNode and blockList?

ajwerner · 2022-02-15T01:43:02Z

pkg/sql/contention/txnidcache/concurrent_write_buffer.go

@@ -20,6 +20,10 @@ const messageBlockSize = 1024



1024 feels pretty big to me. How do you feel about maybe 170? If each ResolvedTxnID is 24 bytes, that gives you as close as you can get with a 4k page and your nodes would be still under 4k too. I think at 170 you've already amortized a ton.

ajwerner · 2022-02-15T01:43:09Z

pkg/sql/contention/txnidcache/fifo_cache.go

+
+	// tailBlockIdx is an index pointing into the next empty slot in messageBlock
+	// stored in the tail pointer.
+	tailBlockIdx int


tailBlockIdx feels like one too many words. tailIdx?

ajwerner · 2022-02-15T01:44:26Z

pkg/sql/contention/txnidcache/fifo_cache_test.go

+func forkBlock(b *messageBlock) *messageBlock {
+	newBlock := &messageBlock{}
+	copy(newBlock[:], b[:])
+	return newBlock
+}


nit: cloneBlock?

Azhng

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/contention/txnidcache/concurrent_write_buffer.go, line 20 at r6 (raw file):

Previously, ajwerner wrote…

1024 feels pretty big to me. How do you feel about maybe 170? If each ResolvedTxnID is 24 bytes, that gives you as close as you can get with a 4k page and your nodes would be still under 4k too. I think at 170 you've already amortized a ton.

Set it to 168 so that it makes some math a bit easier without doing rounding.

pkg/sql/contention/txnidcache/concurrent_write_buffer.go, line 21 at r6 (raw file):

Previously, ajwerner wrote…

If I were to be nit-picky, I like I'd just call this block. The fact that things are "messages" sort of muddles things for me. I think I'd go through and eliminate the word message/msg everywhere. messageSink would become blockSink, messageBlockSize->blockSize msgBlockPool->blockPool. Then for the list, maybe blockListNode and blockList?

Done.

pkg/sql/contention/txnidcache/fifo_cache.go, line 55 at r6 (raw file):

Previously, ajwerner wrote…

tailBlockIdx feels like one too many words. tailIdx?

Done.

pkg/sql/contention/txnidcache/fifo_cache.go, line 75 at r6 (raw file):

Previously, ajwerner wrote…

how come this is exported?

Done. Unexported.

pkg/sql/contention/txnidcache/fifo_cache.go, line 139 at r6 (raw file):

Previously, ajwerner wrote…

I think it might be more elegant if you just passed the block in here as a slice. It's an extra word on the stack, but who cares.

Done.

pkg/sql/contention/txnidcache/fifo_cache.go, line 153 at r6 (raw file):

Previously, ajwerner wrote…

what if, instead of this method, you put a method on the evictionNode which was:

func (e *evictionList) append(msgs []ResolveTxnID) {
    msgs = e.appendToTail(msgs)
    for len(msgs) > 0 {
        e.addNode()
        msgs = e.appendToTail(msgs)
    }
}

func (e *evictionList) addNode() {
    newNode := nodePool.Get().(*evictionNode)
    if e.head == nil {
        e.head = newNode
    } else {
        e.tail.next = newNode
    }
    e.tail = newNode
}

func (e *evictionList) appendToTail(msgs []ResolvedTxnID) (remaining []ResolvedTxnID) {
    if e.head == nil {
        return msgs
    }
    toCopy := messageBlockSize - e.tailBlockIdx
    if toCopy > len(msgs) {
       toCopy = len(msgs)
    }
    copy(e.tail.messageBlock[:], msgs[:toCopy])
    return msgs[toCopy:]
}

Done.

pkg/sql/contention/txnidcache/fifo_cache_test.go, line 211 at r6 (raw file):

Previously, ajwerner wrote…

nit: cloneBlock?

Done.

ajwerner

LGTM

Previously, txnID cache relied on cache.UnorderedCache for storage and FIFO eviction behavior. However, due to unintended allocation behavior within cache.UnorderedCache, this yielded about 20% throughput drop in kv95 benchmark. This commit introduces a new FIFO cache, built to remove all allocation during write operation. This commit removes all performance hit caused by the previous use of cache.UnorderedCache. Resolves cockroachdb#76013 Release note: None

Azhng · 2022-02-15T22:38:56Z

TFTR!

bors r+

craig · 2022-02-16T01:36:37Z

Build succeeded:

GitHub CI (Cockroach)

Azhng force-pushed the txn-id-faster-cache branch from 972dee5 to 30c10c9 Compare February 10, 2022 00:29

Azhng mentioned this pull request Feb 10, 2022

sql: contention event store main tracking issue #74485

Closed

20 tasks

Azhng marked this pull request as ready for review February 10, 2022 01:42

Azhng requested review from a team and ajwerner February 10, 2022 02:00

ajwerner reviewed Feb 10, 2022

View reviewed changes

Azhng force-pushed the txn-id-faster-cache branch from 30c10c9 to db5eb95 Compare February 10, 2022 22:40

Azhng commented Feb 10, 2022

View reviewed changes

Azhng force-pushed the txn-id-faster-cache branch from db5eb95 to c5af5a4 Compare February 10, 2022 22:46

Azhng force-pushed the txn-id-faster-cache branch from c5af5a4 to fd7287a Compare February 10, 2022 23:30

ajwerner reviewed Feb 10, 2022

View reviewed changes

Azhng force-pushed the txn-id-faster-cache branch from fd7287a to 273fcbe Compare February 11, 2022 00:38

ajwerner reviewed Feb 11, 2022

View reviewed changes

Azhng force-pushed the txn-id-faster-cache branch 4 times, most recently from e3c74ed to 4103821 Compare February 11, 2022 03:02

Azhng mentioned this pull request Feb 11, 2022

sql: move txnidcache's sync.Pool to be global variable #76418

Merged

sql: introduce BenchmarkWriter benchmark

00fd06a

This commit introduces the new BenchmarkWriter benchmark to measure the write performance for TxnID Cache. Informs cockroachdb#76013 Release note: None

Azhng force-pushed the txn-id-faster-cache branch from 4103821 to 04610e9 Compare February 14, 2022 16:39

ajwerner approved these changes Feb 15, 2022

View reviewed changes

Azhng force-pushed the txn-id-faster-cache branch 3 times, most recently from bcae826 to 35d2a40 Compare February 15, 2022 18:37

Azhng commented Feb 15, 2022

View reviewed changes

ajwerner approved these changes Feb 15, 2022

View reviewed changes

Azhng force-pushed the txn-id-faster-cache branch from 35d2a40 to 8c35183 Compare February 15, 2022 21:25

craig bot merged commit 9f98cde into cockroachdb:master Feb 16, 2022

ajwerner mentioned this pull request Feb 17, 2022

roachperf: high core-count regression around February 16th #76738

Closed

		@@ -20,6 +20,10 @@ const messageBlockSize = 1024

		type messageBlock [messageBlockSize]ResolvedTxnID

Conversation

Azhng commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Feb 10, 2022

Uh oh!

Azhng commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Azhng left a comment

Choose a reason for hiding this comment

Uh oh!

Azhng commented Feb 10, 2022

Uh oh!

Azhng commented Feb 10, 2022

Uh oh!

Azhng commented Feb 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Azhng left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Azhng commented Feb 15, 2022

Uh oh!

craig bot commented Feb 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Azhng commented Feb 10, 2022 •

edited

Loading

Azhng commented Feb 10, 2022 •

edited

Loading