[do-not-merge] db: add NextPrefix optimization#1378
[do-not-merge] db: add NextPrefix optimization#1378jbowens wants to merge 2 commits intocockroachdb:masterfrom
Conversation
This is a very crude proof of concept of an idea to mitigate the problem of MVCC garbage. The Iterator interface is expanded with a NextPrefix method. When a client no longer cares about the prefix currently positioned under the iterator, it may call NextPrefix to advance to the next distinct prefix. In this proof of concept, NextPrefix does not guarantee it will surface a new prefix, although we could if we buffer the current prefix. Internally, NextPrefix behaves exactly like a Next, except that the internal iterator positioned at the top of the merging iterator's heap is advanced with NextPrefix instead of Next. This gives that iterator an opportunity to use its own context to skip over keys. This change adds one such optimization, gated behind a `PrefixSkipIndexing` flag. If set, Pebble configures an internally-managed block property collector. As an sstable is constructed, this block property collector watches for changes in key prefixes. For every block and index block, it records whether the key prefix changed within the block or index block's span. If the prefix did change, the collector elides any block property. If the prefix did not change, the collector adds a 1-byte property singaling that the block or index block does not contain any changes in prefix from the perspective of forward iteration. When NextPrefix is called on a sstable Reader and the current block is exchausted, the Reader extracts the next block's property. If it exists, indicating the next block did not observe any prefix change during construction, the next block is irrelevant and may be skipped. This optimization is tailored towards forward iteration. Providing the same optimization for backward iteration is trickier, because sstables are constructed in forward iteration and the prefix change indicators are generated from that forward iteration. However, backward iteration may still skip a block if both its block property and the current exhausted block's property indicate that neither contains a prefix change. This isn't implemented in this proof of concept. I ran a benchmark on my gceworker of a forward scan through a keyspace with varying numbers of duplicate keys. All keys had 10-byte values. When the number of versions of a key was 1, 10, 100 this optimization had no benefit and added additional overhead. However, with 1,000 or 10,000 versions of a key there are significant latency savings: ``` name old time/op new time/op delta NextPrefix/versions=1-24 142µs ± 5% 144µs ± 1% ~ (p=0.151 n=5+5) NextPrefix/versions=10-24 1.87ms ± 1% 1.89ms ± 1% ~ (p=0.151 n=5+5) NextPrefix/versions=100-24 12.6ms ± 3% 13.1ms ± 2% +4.38% (p=0.008 n=5+5) NextPrefix/versions=1000-24 147ms ± 2% 26ms ± 2% -82.45% (p=0.008 n=5+5) NextPrefix/versions=10000-24 1.48s ± 1% 0.05s ± 1% -96.77% (p=0.008 n=5+5) name old alloc/op new alloc/op delta NextPrefix/versions=1-24 10.0B ± 0% 10.0B ± 0% ~ (all equal) NextPrefix/versions=10-24 38.0B ± 0% 39.0B ± 0% +2.63% (p=0.016 n=4+5) NextPrefix/versions=100-24 141B ±41% 172B ± 9% +21.64% (p=0.032 n=5+5) NextPrefix/versions=1000-24 3.95kB ±16% 0.69kB ± 5% -82.43% (p=0.008 n=5+5) NextPrefix/versions=10000-24 169kB ± 2% 9kB ± 6% -94.87% (p=0.008 n=5+5) name old allocs/op new allocs/op delta NextPrefix/versions=1-24 1.00 ± 0% 1.00 ± 0% ~ (all equal) NextPrefix/versions=10-24 1.00 ± 0% 1.00 ± 0% ~ (all equal) NextPrefix/versions=100-24 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.016 n=4+5) NextPrefix/versions=1000-24 34.8 ±18% 6.0 ± 0% -82.76% (p=0.016 n=5+4) NextPrefix/versions=10000-24 1.74k ± 2% 0.10k ± 5% -94.00% (p=0.008 n=5+5) ``` I suspect the overhead for lower version counts can be reduced by only attempting this optimization if we've already iterated over the same prefix several times. Also, there's an additional optimization with NextPrefix that we can make that will mostly benefit when there's a small number of versions. The NextPrefix InternalIterator interface can be extended to propagate the length of the current prefix, which can be used by the block iterator NextPrefix to skip over keys so long as the number of bytes shared in prefix sharing is greater than the current prefix length.
|
Nice! Regarding the following with ~20 bytes for each key-value pair, with 100 versions we have ~2000 bytes, so the block property optimization would not help with 32KB blocks. Are all these gains coming from avoiding the key prefix comparison? |
jbowens
left a comment
There was a problem hiding this comment.
Are all these gains coming from avoiding the key prefix comparison?
Yeah, that's right. This microbenchmark uses an in-memory FS, so I expect in a real workload the benefit would be less pronounced, but it is encouraging
Reviewable status: 0 of 17 files reviewed, all discussions resolved
|
Can you pull out the key prefix comparison optimization into a separate PR. I think the |
|
Work-in-progress productionizing of the prefix-compression optimization is happening in #1387. I'm going to close this out as it's not intended to be merged, and we can consult the closed pull request if we ever consider productionizing the block-property collector optimization. |
This is a very crude proof of concept of an idea to mitigate the problem of
MVCC garbage. The Iterator interface is expanded with a NextPrefix method.
When a client no longer cares about the prefix currently positioned under
the iterator, it may call NextPrefix to advance to the next distinct prefix.
In this proof of concept, NextPrefix does not guarantee it will surface a new
prefix, although we could if we buffer the current prefix.
Internally, NextPrefix behaves exactly like a Next, except that the internal
iterator positioned at the top of the merging iterator's heap is advanced with
NextPrefix instead of Next. This gives that iterator an opportunity to use its
own context to skip over keys.
This change adds one such optimization, gated behind a
PrefixSkipIndexingflag. If set, Pebble configures an internally-managed block property
collector. As an sstable is constructed, this block property collector watches
for changes in key prefixes. For every block and index block, it records
whether the key prefix changed within the block or index block's span. If the
prefix did change, the collector elides any block property. If the prefix did
not change, the collector adds a 1-byte property singaling that the block or
index block does not contain any changes in prefix from the perspective of
forward iteration.
When NextPrefix is called on a sstable Reader and the current block is
exchausted, the Reader extracts the next block's property. If it exists,
indicating the next block did not observe any prefix change during
construction, the next block is irrelevant and may be skipped.
This optimization is tailored towards forward iteration. Providing the same
optimization for backward iteration is trickier, because sstables are
constructed in forward iteration and the prefix change indicators are
generated from that forward iteration. However, backward iteration may still
skip a block if both its block property and the current exhausted block's
property indicate that neither contains a prefix change. This isn't
implemented in this proof of concept.
I also made an additional optimization with NextPrefix that helps avoid key
comparisons for versions of the same key within a block. The NextPrefix
InternalIterator interface propagates the length of the current prefix, which
can be used by the block iterator NextPrefix to skip over keys so long as the
number of bytes shared in prefix sharing is greater than the current prefix
length.
I ran a benchmark on my gceworker of a forward scan through a keyspace with
varying numbers of duplicate keys. All keys had 10-byte values.:
I suspect the overhead for 1-version per key can be reduced by only attempting
this optimization if we've already seen the same prefix multiple times.