[do-not-merge] db: add NextPrefix optimization by jbowens · Pull Request #1378 · cockroachdb/pebble

jbowens · 2021-11-19T19:23:23Z

This is a very crude proof of concept of an idea to mitigate the problem of
MVCC garbage. The Iterator interface is expanded with a NextPrefix method.
When a client no longer cares about the prefix currently positioned under
the iterator, it may call NextPrefix to advance to the next distinct prefix.
In this proof of concept, NextPrefix does not guarantee it will surface a new
prefix, although we could if we buffer the current prefix.

Internally, NextPrefix behaves exactly like a Next, except that the internal
iterator positioned at the top of the merging iterator's heap is advanced with
NextPrefix instead of Next. This gives that iterator an opportunity to use its
own context to skip over keys.

This change adds one such optimization, gated behind a PrefixSkipIndexing
flag. If set, Pebble configures an internally-managed block property
collector. As an sstable is constructed, this block property collector watches
for changes in key prefixes. For every block and index block, it records
whether the key prefix changed within the block or index block's span. If the
prefix did change, the collector elides any block property. If the prefix did
not change, the collector adds a 1-byte property singaling that the block or
index block does not contain any changes in prefix from the perspective of
forward iteration.

When NextPrefix is called on a sstable Reader and the current block is
exchausted, the Reader extracts the next block's property. If it exists,
indicating the next block did not observe any prefix change during
construction, the next block is irrelevant and may be skipped.

This optimization is tailored towards forward iteration. Providing the same
optimization for backward iteration is trickier, because sstables are
constructed in forward iteration and the prefix change indicators are
generated from that forward iteration. However, backward iteration may still
skip a block if both its block property and the current exhausted block's
property indicate that neither contains a prefix change. This isn't
implemented in this proof of concept.

I also made an additional optimization with NextPrefix that helps avoid key
comparisons for versions of the same key within a block. The NextPrefix
InternalIterator interface propagates the length of the current prefix, which
can be used by the block iterator NextPrefix to skip over keys so long as the
number of bytes shared in prefix sharing is greater than the current prefix
length.

I ran a benchmark on my gceworker of a forward scan through a keyspace with
varying numbers of duplicate keys. All keys had 10-byte values.:

name                          old time/op    new time/op    delta
NextPrefix/versions=1-24         142µs ± 5%     150µs ± 4%   +5.86%  (p=0.032 n=5+5)
NextPrefix/versions=10-24       1.87ms ± 1%    0.40ms ± 0%  -78.46%  (p=0.008 n=5+5)
NextPrefix/versions=100-24      12.6ms ± 3%     2.6ms ± 2%  -79.31%  (p=0.008 n=5+5)
NextPrefix/versions=1000-24      147ms ± 2%       6ms ± 2%  -95.88%  (p=0.008 n=5+5)
NextPrefix/versions=10000-24     1.48s ± 1%     0.02s ± 7%  -98.74%  (p=0.008 n=5+5)

name                          old alloc/op   new alloc/op   delta
NextPrefix/versions=1-24         10.0B ± 0%     10.0B ± 0%     ~     (all equal)
NextPrefix/versions=10-24        38.0B ± 0%     14.4B ± 4%  -62.11%  (p=0.000 n=4+5)
NextPrefix/versions=100-24        141B ±41%       37B ± 3%  -73.83%  (p=0.008 n=5+5)
NextPrefix/versions=1000-24     3.95kB ±16%    0.15kB ± 3%  -96.19%  (p=0.008 n=5+5)
NextPrefix/versions=10000-24     169kB ± 2%       4kB ±12%  -97.82%  (p=0.008 n=5+5)

name                          old allocs/op  new allocs/op  delta
NextPrefix/versions=1-24          1.00 ± 0%      1.00 ± 0%     ~     (all equal)
NextPrefix/versions=10-24         1.00 ± 0%      1.00 ± 0%     ~     (all equal)
NextPrefix/versions=100-24        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
NextPrefix/versions=1000-24       34.8 ±18%       2.0 ± 0%  -94.25%  (p=0.008 n=5+5)
NextPrefix/versions=10000-24     1.74k ± 2%     0.05k ± 7%  -96.99%  (p=0.008 n=5+5)

I suspect the overhead for 1-version per key can be reduced by only attempting
this optimization if we've already seen the same prefix multiple times.

This is a very crude proof of concept of an idea to mitigate the problem of MVCC garbage. The Iterator interface is expanded with a NextPrefix method. When a client no longer cares about the prefix currently positioned under the iterator, it may call NextPrefix to advance to the next distinct prefix. In this proof of concept, NextPrefix does not guarantee it will surface a new prefix, although we could if we buffer the current prefix. Internally, NextPrefix behaves exactly like a Next, except that the internal iterator positioned at the top of the merging iterator's heap is advanced with NextPrefix instead of Next. This gives that iterator an opportunity to use its own context to skip over keys. This change adds one such optimization, gated behind a `PrefixSkipIndexing` flag. If set, Pebble configures an internally-managed block property collector. As an sstable is constructed, this block property collector watches for changes in key prefixes. For every block and index block, it records whether the key prefix changed within the block or index block's span. If the prefix did change, the collector elides any block property. If the prefix did not change, the collector adds a 1-byte property singaling that the block or index block does not contain any changes in prefix from the perspective of forward iteration. When NextPrefix is called on a sstable Reader and the current block is exchausted, the Reader extracts the next block's property. If it exists, indicating the next block did not observe any prefix change during construction, the next block is irrelevant and may be skipped. This optimization is tailored towards forward iteration. Providing the same optimization for backward iteration is trickier, because sstables are constructed in forward iteration and the prefix change indicators are generated from that forward iteration. However, backward iteration may still skip a block if both its block property and the current exhausted block's property indicate that neither contains a prefix change. This isn't implemented in this proof of concept. I ran a benchmark on my gceworker of a forward scan through a keyspace with varying numbers of duplicate keys. All keys had 10-byte values. When the number of versions of a key was 1, 10, 100 this optimization had no benefit and added additional overhead. However, with 1,000 or 10,000 versions of a key there are significant latency savings: ``` name old time/op new time/op delta NextPrefix/versions=1-24 142µs ± 5% 144µs ± 1% ~ (p=0.151 n=5+5) NextPrefix/versions=10-24 1.87ms ± 1% 1.89ms ± 1% ~ (p=0.151 n=5+5) NextPrefix/versions=100-24 12.6ms ± 3% 13.1ms ± 2% +4.38% (p=0.008 n=5+5) NextPrefix/versions=1000-24 147ms ± 2% 26ms ± 2% -82.45% (p=0.008 n=5+5) NextPrefix/versions=10000-24 1.48s ± 1% 0.05s ± 1% -96.77% (p=0.008 n=5+5) name old alloc/op new alloc/op delta NextPrefix/versions=1-24 10.0B ± 0% 10.0B ± 0% ~ (all equal) NextPrefix/versions=10-24 38.0B ± 0% 39.0B ± 0% +2.63% (p=0.016 n=4+5) NextPrefix/versions=100-24 141B ±41% 172B ± 9% +21.64% (p=0.032 n=5+5) NextPrefix/versions=1000-24 3.95kB ±16% 0.69kB ± 5% -82.43% (p=0.008 n=5+5) NextPrefix/versions=10000-24 169kB ± 2% 9kB ± 6% -94.87% (p=0.008 n=5+5) name old allocs/op new allocs/op delta NextPrefix/versions=1-24 1.00 ± 0% 1.00 ± 0% ~ (all equal) NextPrefix/versions=10-24 1.00 ± 0% 1.00 ± 0% ~ (all equal) NextPrefix/versions=100-24 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.016 n=4+5) NextPrefix/versions=1000-24 34.8 ±18% 6.0 ± 0% -82.76% (p=0.016 n=5+4) NextPrefix/versions=10000-24 1.74k ± 2% 0.10k ± 5% -94.00% (p=0.008 n=5+5) ``` I suspect the overhead for lower version counts can be reduced by only attempting this optimization if we've already iterated over the same prefix several times. Also, there's an additional optimization with NextPrefix that we can make that will mostly benefit when there's a small number of versions. The NextPrefix InternalIterator interface can be extended to propagate the length of the current prefix, which can be used by the block iterator NextPrefix to skip over keys so long as the number of bytes shared in prefix sharing is greater than the current prefix length.

cockroach-teamcity · 2021-11-19T19:23:28Z

This change is

sumeerbhola · 2021-11-22T17:24:46Z

Nice!

Regarding the following

NextPrefix/versions=10-24       1.87ms ± 1%    0.40ms ± 0%  -78.46%  (p=0.008 n=5+5)
NextPrefix/versions=100-24      12.6ms ± 3%     2.6ms ± 2%  -79.31%  (p=0.008 n=5+5)

with ~20 bytes for each key-value pair, with 100 versions we have ~2000 bytes, so the block property optimization would not help with 32KB blocks. Are all these gains coming from avoiding the key prefix comparison?

jbowens

Are all these gains coming from avoiding the key prefix comparison?

Yeah, that's right. This microbenchmark uses an in-memory FS, so I expect in a real workload the benefit would be less pronounced, but it is encouraging

Reviewable status: 0 of 17 files reviewed, all discussions resolved

sumeerbhola · 2021-11-23T16:29:55Z

Can you pull out the key prefix comparison optimization into a separate PR. I think the NextPrefix method plus that optimization is clean enough to merge.

jbowens · 2021-12-08T20:09:57Z

Work-in-progress productionizing of the prefix-compression optimization is happening in #1387. I'm going to close this out as it's not intended to be merged, and we can consult the closed pull request if we ever consider productionizing the block-property collector optimization.

db: add NextPrefix prefix sharing optimization

34f3b8b

jbowens commented Nov 23, 2021

View reviewed changes

jbowens mentioned this pull request Nov 23, 2021

sstable: add parallelized RewriteKeySuffixes function #1377

Merged

jbowens mentioned this pull request Nov 24, 2021

db: add NextPrefix method to Iterator #1387

Closed

jbowens closed this Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[do-not-merge] db: add NextPrefix optimization#1378

[do-not-merge] db: add NextPrefix optimization#1378
jbowens wants to merge 2 commits intocockroachdb:masterfrom
jbowens:prefix-skip

jbowens commented Nov 19, 2021 •

edited

Loading

Uh oh!

cockroach-teamcity commented Nov 19, 2021

Uh oh!

sumeerbhola commented Nov 22, 2021

Uh oh!

jbowens left a comment

Uh oh!

sumeerbhola commented Nov 23, 2021

Uh oh!

jbowens commented Dec 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jbowens commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Nov 19, 2021

Uh oh!

sumeerbhola commented Nov 22, 2021

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Nov 23, 2021

Uh oh!

jbowens commented Dec 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jbowens commented Nov 19, 2021 •

edited

Loading