Skip to content

sql: introduce new FIFO cache for txnID cache#76350

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
Azhng:txn-id-faster-cache
Feb 16, 2022
Merged

sql: introduce new FIFO cache for txnID cache#76350
craig[bot] merged 2 commits intocockroachdb:masterfrom
Azhng:txn-id-faster-cache

Conversation

@Azhng
Copy link
Copy Markdown
Contributor

@Azhng Azhng commented Feb 10, 2022

Reviewer note: first commit belongs to #76244


Previously, txnID cache relied on cache.UnorderedCache for storage and
FIFO eviction behavior. However, due to unintended allocation behavior
within cache.UnorderedCache, this yielded about 20% throughput drop in
kv95 benchmark.

This commit introduces a new FIFO cache, built to remove all allocation
during write operation. This commit removes all performance hit caused
by the previous use of cache.UnorderedCache.

Resolves #76013

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@Azhng
Copy link
Copy Markdown
Contributor Author

Azhng commented Feb 10, 2022

kv95 benchmark:

Branch elapsed ops(total) ops/sec(cum) avg(ms) p50(ms) p95(ms) p99(ms) pMax(ms)
no-cache 2700.0s 55785736 20661.4 3.1 2.4 8.4 13.1 75.5
old-cache 2700.0s 49867068 18469.3 3.5 2.5 10.0 17.8 671.1
new-cache 2700.0s 60666291 22469.0 2.8 2.2 7.6 11.0 75.5

Microbenchmark:

$ benchstat ~/tmp/BenchmarkWriter-master/bench ~/tmp/BenchmarkWriter-typed-cache/bench | grep -v 'blackHole'                                                   [18:27:15]
name                                               old time/op    new time/op    delta
Writer/sinkType=real/concurrentWriter=1-24            787ns ± 7%     363ns ± 1%   -53.87%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=24-24           760ns ±15%     366ns ± 2%   -51.80%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24           735ns ±12%     364ns ± 1%   -50.49%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=64-24           775ns ± 8%     360ns ± 2%   -53.51%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=92-24           790ns ± 4%     358ns ± 5%   -54.63%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24          782ns ± 6%     356ns ± 3%   -54.47%  (p=0.000 n=10+10)

name                                               old speed      new speed      delta
Writer/sinkType=real/concurrentWriter=1-24         31.3GB/s ± 7%  67.7GB/s ± 1%  +116.40%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=24-24        32.6GB/s ±17%  67.1GB/s ± 2%  +105.99%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24        33.6GB/s ±13%  67.6GB/s ± 1%  +101.01%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=64-24        31.8GB/s ± 7%  68.2GB/s ± 2%  +114.76%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=92-24        31.1GB/s ± 3%  68.6GB/s ± 5%  +120.46%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24       31.5GB/s ± 6%  69.0GB/s ± 3%  +119.46%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
Writer/sinkType=real/concurrentWriter=1-24             239B ± 2%       71B ± 2%   -70.44%  (p=0.000 n=9+9)
Writer/sinkType=real/concurrentWriter=24-24            239B ± 2%       74B ± 6%   -69.18%  (p=0.000 n=8+10)
Writer/sinkType=real/concurrentWriter=48-24            242B ± 2%       73B ± 4%   -69.82%  (p=0.000 n=8+10)
Writer/sinkType=real/concurrentWriter=64-24            239B ± 2%       70B ± 5%   -70.82%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=92-24            238B ± 1%       70B ± 3%   -70.68%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=128-24           238B ± 2%       69B ± 6%   -70.95%  (p=0.000 n=10+10)

name                                               old allocs/op  new allocs/op  delta
Writer/sinkType=real/concurrentWriter=1-24             2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=24-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=64-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=92-24            2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24           2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)

@Azhng Azhng force-pushed the txn-id-faster-cache branch from 972dee5 to 30c10c9 Compare February 10, 2022 00:29
@Azhng Azhng marked this pull request as ready for review February 10, 2022 01:42
@Azhng Azhng requested review from a team and ajwerner February 10, 2022 02:00
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @Azhng)


pkg/sql/contention/txnidcache/fifo_cache.go, line 68 at r2 (raw file):

	}

	c.mu.eviction.insertBlock(block)

I worry about cases where you flush due to a timer as opposed to because the blocks are full and then you end up with mostly empty blocks in the cache. Perhaps it'd be better to have evictionNode have its own embedded messageBlock (as opposed to *messageBlock) and copy out of the block in add and into the current tail, filling up blocks in the FIFO as you go. Then you know you'll only ever have at most one non-full block. Then, sync.Pool the evictionNode.

@Azhng Azhng force-pushed the txn-id-faster-cache branch from 30c10c9 to db5eb95 Compare February 10, 2022 22:40
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/contention/txnidcache/fifo_cache.go, line 68 at r2 (raw file):

Previously, ajwerner wrote…

I worry about cases where you flush due to a timer as opposed to because the blocks are full and then you end up with mostly empty blocks in the cache. Perhaps it'd be better to have evictionNode have its own embedded messageBlock (as opposed to *messageBlock) and copy out of the block in add and into the current tail, filling up blocks in the FIFO as you go. Then you know you'll only ever have at most one non-full block. Then, sync.Pool the evictionNode.

I went down the route of just directly hijacking the blocks and use them in the eviction list. (mostly was because I misunderstood your comment when I first read it).

Thoughts on this approach?

@Azhng Azhng force-pushed the txn-id-faster-cache branch from db5eb95 to c5af5a4 Compare February 10, 2022 22:46
@Azhng
Copy link
Copy Markdown
Contributor Author

Azhng commented Feb 10, 2022

Compared to the previous approach (that risked having many empty blocks) I think the new overhead here is about 7% in the microbenchmark. I will follow up with the kv95 result once it's ready.

azhng@crlMBP-C02F814DMD6TMTA1: ~/go/src/github.com/cockroachdb/cockroach txn-id-faster-cache ⚡
$ benchstat ~/tmp/BenchmarkWriter-typed-cache/bench ~/tmp/BenchmarkWriter-compact-evict-list/bench | grep -v 'blackHole'                                     [17:49:23]
name                                               old time/op    new time/op    delta
Writer/sinkType=real/concurrentWriter=1-24            363ns ± 1%     389ns ± 3%  +7.20%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=24-24           366ns ± 2%     393ns ± 3%  +7.39%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=48-24           364ns ± 1%     388ns ± 3%  +6.64%  (p=0.000 n=9+9)
Writer/sinkType=real/concurrentWriter=64-24           360ns ± 2%     387ns ± 2%  +7.51%  (p=0.000 n=9+10)
Writer/sinkType=real/concurrentWriter=92-24           358ns ± 5%     388ns ± 3%  +8.27%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24          356ns ± 3%     386ns ± 1%  +8.43%  (p=0.000 n=10+10)

name                                               old speed      new speed      delta
Writer/sinkType=real/concurrentWriter=1-24         67.7GB/s ± 1%  63.2GB/s ± 3%  -6.67%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=24-24        67.1GB/s ± 2%  62.5GB/s ± 3%  -6.88%  (p=0.000 n=10+9)
Writer/sinkType=real/concurrentWriter=48-24        67.6GB/s ± 1%  63.4GB/s ± 3%  -6.19%  (p=0.000 n=9+9)
Writer/sinkType=real/concurrentWriter=64-24        68.2GB/s ± 2%  63.4GB/s ± 2%  -6.98%  (p=0.000 n=9+10)
Writer/sinkType=real/concurrentWriter=92-24        68.6GB/s ± 5%  63.4GB/s ± 3%  -7.67%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=128-24       69.0GB/s ± 3%  63.7GB/s ± 1%  -7.79%  (p=0.000 n=10+10)

name                                               old alloc/op   new alloc/op   delta
Writer/sinkType=real/concurrentWriter=1-24            70.6B ± 2%     68.9B ± 3%  -2.36%  (p=0.009 n=9+9)
Writer/sinkType=real/concurrentWriter=24-24           73.7B ± 6%     69.4B ± 5%  -5.83%  (p=0.001 n=10+10)
Writer/sinkType=real/concurrentWriter=48-24           73.0B ± 4%     67.7B ± 6%  -7.26%  (p=0.000 n=10+10)
Writer/sinkType=real/concurrentWriter=64-24           69.8B ± 5%     69.4B ± 2%    ~     (p=0.975 n=10+8)
Writer/sinkType=real/concurrentWriter=92-24           69.8B ± 3%     70.2B ± 5%    ~     (p=0.576 n=9+10)
Writer/sinkType=real/concurrentWriter=128-24          69.1B ± 6%     70.8B ± 4%  +2.46%  (p=0.034 n=10+10)

name                                               old allocs/op  new allocs/op  delta
Writer/sinkType=real/concurrentWriter=1-24             0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=24-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=48-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=64-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=92-24            0.00           0.00         ~     (all equal)
Writer/sinkType=real/concurrentWriter=128-24           0.00           0.00         ~     (all equal)

@Azhng Azhng force-pushed the txn-id-faster-cache branch from c5af5a4 to fd7287a Compare February 10, 2022 23:30
@Azhng
Copy link
Copy Markdown
Contributor Author

Azhng commented Feb 10, 2022

Updated kv95 benchmark. Hmm seems like it's a bit slower than the previous simple eviction list implementation.

Branch elapsed errors ops(total) ops/sec(cum) avg(ms) p50(ms) p95(ms) p99(ms) pMax(ms)
baseline 2700.0s 0 55785736 20661.4 3.1 2.4 8.4 13.1 75.5
cache 2700.0s 0 49867068 18469.3 3.5 2.5 10.0 17.8 671.1
cache-noop-sink 2700.0s 0 56440227 20903.8 3.1 2.4 8.1 13.1 71.3
cache-regular-evict-list 2700.0s 0 60666291 22469.0 2.8 2.2 7.6 11.0 75.5
cache-compact-evict-list 2700.0s 0 54782029 20289.6 3.2 2.5 8.4 12.6 109.1

@Azhng
Copy link
Copy Markdown
Contributor Author

Azhng commented Feb 10, 2022

Raw files output from both kv95 runs

kv95.compact-list.txt
kv95.normal-list.txt

blockSize++
}

if blockHijacked := c.mu.eviction.insertBlock(block, blockSize); !blockHijacked {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of this hijacking feels too clever to me. I feel like just embedding the messageBlock as a value in eviction list and pooling that will be simpler from a data abstraction perspective layer. You can do it as a single for loop over the block. I don't feel that strongly though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched to using pool for node allocation.

}
}

type evictionNode struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these things need to be pooled. I'd like to see the benchmark allocating on average zero or near zero bytes as opposed to the current ~70. I wouldn't focus so much on the runtime in terms of nanoseconds for this, if you get the allocations to zero, my money is that we won't see big impact on the kv workload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Azhng Azhng force-pushed the txn-id-faster-cache branch from fd7287a to 273fcbe Compare February 11, 2022 00:38
Comment on lines +62 to +66
nodePool: &sync.Pool{
New: func() interface{} {
return &evictionNode{}
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Pool objects should be global vars. There's no good reason to have more than one for a given type. Every instance gets registered in the runtime and updated on every gc cycle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also created #76418 to clean up the rest of the sync.Pool usage.

@Azhng Azhng force-pushed the txn-id-faster-cache branch 4 times, most recently from e3c74ed to 4103821 Compare February 11, 2022 03:02
This commit introduces the new BenchmarkWriter benchmark to measure the
write performance for TxnID Cache.

Informs cockroachdb#76013

Release note: None
@Azhng Azhng force-pushed the txn-id-faster-cache branch from 4103821 to 04610e9 Compare February 14, 2022 16:39
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cleanup nits from me. I didn't scrutinize the tests all that closely, but they seemed fine.

Mod the nits, and the naming one really are up to you, 👍 LGTM

}

// insertBlock insert the messageBlock into the eviction list.
func (e *evictionList) insertBlock(block *messageBlock, blockSize int) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be more elegant if you just passed the block in here as a slice. It's an extra word on the stack, but who cares.

}

// Handle when the tail block is full.
if e.tail.isFull() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if, instead of this method, you put a method on the evictionNode which was:

func (e *evictionList) append(msgs []ResolveTxnID) {
    msgs = e.appendToTail(msgs)
    for len(msgs) > 0 {
        e.addNode()
        msgs = e.appendToTail(msgs)
    }
}

func (e *evictionList) addNode() {
    newNode := nodePool.Get().(*evictionNode)
    if e.head == nil {
        e.head = newNode
    } else {
        e.tail.next = newNode
    }
    e.tail = newNode
}

func (e *evictionList) appendToTail(msgs []ResolvedTxnID) (remaining []ResolvedTxnID) {
    if e.head == nil {
        return msgs
    }
    toCopy := messageBlockSize - e.tailBlockIdx
    if toCopy > len(msgs) {
       toCopy = len(msgs)
    }
    copy(e.tail.messageBlock[:], msgs[:toCopy])
    return msgs[toCopy:]
}

// fifoCache sometimes directly hijack the input messageBlock to form the
// eviction list and often even directly overwrite its content to minimize
// memory allocation.
func (c *fifoCache) Add(block *messageBlock) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come this is exported?

@@ -20,6 +20,10 @@ const messageBlockSize = 1024

type messageBlock [messageBlockSize]ResolvedTxnID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to be nit-picky, I like I'd just call this block. The fact that things are "messages" sort of muddles things for me. I think I'd go through and eliminate the word message/msg everywhere. messageSink would become blockSink, messageBlockSize->blockSize msgBlockPool->blockPool. Then for the list, maybe blockListNode and blockList?

@@ -20,6 +20,10 @@ const messageBlockSize = 1024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 feels pretty big to me. How do you feel about maybe 170? If each ResolvedTxnID is 24 bytes, that gives you as close as you can get with a 4k page and your nodes would be still under 4k too. I think at 170 you've already amortized a ton.


// tailBlockIdx is an index pointing into the next empty slot in messageBlock
// stored in the tail pointer.
tailBlockIdx int
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tailBlockIdx feels like one too many words. tailIdx?

Comment on lines +207 to +211
func forkBlock(b *messageBlock) *messageBlock {
newBlock := &messageBlock{}
copy(newBlock[:], b[:])
return newBlock
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cloneBlock?

@Azhng Azhng force-pushed the txn-id-faster-cache branch 3 times, most recently from bcae826 to 35d2a40 Compare February 15, 2022 18:37
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/contention/txnidcache/concurrent_write_buffer.go, line 20 at r6 (raw file):

Previously, ajwerner wrote…

1024 feels pretty big to me. How do you feel about maybe 170? If each ResolvedTxnID is 24 bytes, that gives you as close as you can get with a 4k page and your nodes would be still under 4k too. I think at 170 you've already amortized a ton.

Set it to 168 so that it makes some math a bit easier without doing rounding.


pkg/sql/contention/txnidcache/concurrent_write_buffer.go, line 21 at r6 (raw file):

Previously, ajwerner wrote…

If I were to be nit-picky, I like I'd just call this block. The fact that things are "messages" sort of muddles things for me. I think I'd go through and eliminate the word message/msg everywhere. messageSink would become blockSink, messageBlockSize->blockSize msgBlockPool->blockPool. Then for the list, maybe blockListNode and blockList?

Done.


pkg/sql/contention/txnidcache/fifo_cache.go, line 55 at r6 (raw file):

Previously, ajwerner wrote…

tailBlockIdx feels like one too many words. tailIdx?

Done.


pkg/sql/contention/txnidcache/fifo_cache.go, line 75 at r6 (raw file):

Previously, ajwerner wrote…

how come this is exported?

Done. Unexported.


pkg/sql/contention/txnidcache/fifo_cache.go, line 139 at r6 (raw file):

Previously, ajwerner wrote…

I think it might be more elegant if you just passed the block in here as a slice. It's an extra word on the stack, but who cares.

Done.


pkg/sql/contention/txnidcache/fifo_cache.go, line 153 at r6 (raw file):

Previously, ajwerner wrote…

what if, instead of this method, you put a method on the evictionNode which was:

func (e *evictionList) append(msgs []ResolveTxnID) {
    msgs = e.appendToTail(msgs)
    for len(msgs) > 0 {
        e.addNode()
        msgs = e.appendToTail(msgs)
    }
}

func (e *evictionList) addNode() {
    newNode := nodePool.Get().(*evictionNode)
    if e.head == nil {
        e.head = newNode
    } else {
        e.tail.next = newNode
    }
    e.tail = newNode
}

func (e *evictionList) appendToTail(msgs []ResolvedTxnID) (remaining []ResolvedTxnID) {
    if e.head == nil {
        return msgs
    }
    toCopy := messageBlockSize - e.tailBlockIdx
    if toCopy > len(msgs) {
       toCopy = len(msgs)
    }
    copy(e.tail.messageBlock[:], msgs[:toCopy])
    return msgs[toCopy:]
}

Done.


pkg/sql/contention/txnidcache/fifo_cache_test.go, line 211 at r6 (raw file):

Previously, ajwerner wrote…

nit: cloneBlock?

Done.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Previously, txnID cache relied on cache.UnorderedCache for storage and
FIFO eviction behavior. However, due to unintended allocation behavior
within cache.UnorderedCache, this yielded about 20% throughput drop in
kv95 benchmark.

This commit introduces a new FIFO cache, built to remove all allocation
during write operation. This commit removes all performance hit caused
by the previous use of cache.UnorderedCache.

Resolves cockroachdb#76013

Release note: None
@Azhng Azhng force-pushed the txn-id-faster-cache branch from 35d2a40 to 8c35183 Compare February 15, 2022 21:25
@Azhng
Copy link
Copy Markdown
Contributor Author

Azhng commented Feb 15, 2022

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 16, 2022

Build succeeded:

@craig craig bot merged commit 9f98cde into cockroachdb:master Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachperf: kv95 regression around February 1st

3 participants