Skip to content

sstable: add parallelized RewriteKeySuffixes function#1377

Merged
dt merged 1 commit intocockroachdb:masterfrom
dt:sst-rewrite
Jan 31, 2022
Merged

sstable: add parallelized RewriteKeySuffixes function#1377
dt merged 1 commit intocockroachdb:masterfrom
dt:sst-rewrite

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented Nov 18, 2021

This adds a function to the sstable package that can replace some suffix
with another suffix from every key in an sstable, assuming enforcing
that the sstable consists of only Sets.

If the replacement is passed a filter policy, rather than build new
filters, the filter blocks from the original sstable are copied over,
but only after checking that it used the same filter policy and the same
splitter, and that that splitter splits the same length that as is being
replaced i.e. that the pre-replacement filter remains valid
post-replacement.

Data blocks are rewritten in parallel. This informs a few choices in how
it does so: rewritten keys are not passed to addPoint, but rather are
directly handed to a worker-local blockWriter. Finished blocks are not
immediately flushed to the output or added to the index, but rather are
added to a buffer, which is then iterated and flushed at the end.

A significant implication of not passing each key to addPoint is that
individual KVs are not passed to the table and block property collectors
during rewriting. Instead, when flushing the finished blocks, each table
collector is passed just the first key of the table and each block prop
collector is passed the first key of each block. For the collectors that
CockroachDB uses, to which only the timestamp key suffix is relevant,
this approach may be sufficient, as during replacement all keys in all
blocks have the same suffix. However extending the collector API could
allow this to be more general, e.g. either by making a separate API for
this approach, like AddOneKeyForBlock, or by making the collectors able
to collect and aggregate partial results, so each rewriter could have a
local partial collector and then merge their results. However as the
simple approach of just passing one key per block to the existing API
appears sufficient for CockroachDB, this is left for later exploration.

Benchmarking this implementation compared to simply opening a writer and
feeding each key from a reader to it shows modest gains even without the
addition of concurrency, just from skipping rebuilding filters and props
from every key, with a 12M SST rewriting at about 33% faster. With the
addition of concurrency=8, this jumps to about a 7.5x speedup over the
naive read/write iteration, of 5x over concurrency=1.

BenchmarkRewriter
    writer_test.go:357: rewriting a 12 M SSTable
BenchmarkRewriter/RewriteKeySuffixes,concurrency=1
BenchmarkRewriter/RewriteKeySuffixes,concurrency=1-10                130          91251942 ns/op
BenchmarkRewriter/RewriteKeySuffixes,concurrency=8
BenchmarkRewriter/RewriteKeySuffixes,concurrency=8-10                666          18205387 ns/op
BenchmarkRewriter/read-and-write-new-sst
BenchmarkRewriter/read-and-write-new-sst-10                           88         134926796 ns/op

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

We do anticipate computing block property collectors over data beyond just the key's timestamp. We've been considering explorations around pushing down SQL filters and projections into Pebble, and using block property collectors to aggregate statistics over keys within a block.

Concurrency while reading/decompressing and writing/compressing is desirable more generally (eg, #599), and I'm hopeful that we can combine these use cases into one facility that offers parallelism. I suspect that we can get the performance we need while doing a sequential iteration that computes block property collectors sequentially but defers everything else to parallel workers.

Reviewable status: 0 of 7 files reviewed, all discussions resolved (waiting on @erikgrinaker and @sumeerbhola)

@dt dt force-pushed the sst-rewrite branch 5 times, most recently from 279a02b to 2b3cd36 Compare November 18, 2021 20:21
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Nov 18, 2021

@jbowens I'm not too worried about future table and block prop collectors that require seeing every kv. I think for table prop collectors we could extend their API to have a AddAll(other PropCollector) function s.t. each worker could have a local collector them merge them at the end. For block collectors, I think we could probably have a local instance of the collector be passed every kv and collect props for a block, then encode its result and store it along side the block itself in the buffer waiting to be flushed, with some tweak to the API to reduce the assumption that a single instance sees all blocks in order for the collector used in the local worker and then potentially having some sum op over all the encoded ones at the end ala the table collectors. Finally, we might also measure the cost of collecting those props during ingestion and determine is simply isn't worth it -- if it is even 10% overhead, that might be enough for us to say that getting RESTORE done sooner is more important since that is our disaster recovery time. Basically, I think we have plenty of options, so I'm not too worried about figuring out that future when it comes, and for the current collectors, the simple thing works fine.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data blocks are rewritten in parallel.

Do you have a cpu profile of this benchmark. I am curious what are the big consumers of cpu in the part that has been parallelized.

Reviewable status: 0 of 7 files reviewed, 1 unresolved discussion (waiting on @dt and @erikgrinaker)


sstable/suffix_rewriter.go, line 92 at r5 (raw file):

	}

	for i := worker; i < len(input); i += totalWorkers {

there is an assumption here that the work per block is not heavily skewed, which is why this static distribution that gives each worker an equal number of blocks is acceptable. Worth noting in a code comment.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Nov 18, 2021

what are the big consumers of cpu in the part that has been parallelized.

I can post an actual profile later, but when I profiled it earlier it looked like the biggest was compression and decompression, with this implementation spending about 21% in compression vs 12% in loop reader/writer and 9% vs 8% in decompression. blockWriter.store was pretty large too IIRC.

I looked briefly at allocs and tried buffering the Clone allocs which cut overall allocs by >30%, but that yielded no significant top line perf savings.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Nov 19, 2021

@jbowens Another thing that occurred to me last night: ideally we'd copy over most higher level logical props like that blindly, instead of recomputing them, since the whole point is to minimize cost.

Maybe very significant here is the (enforced) stipulation that suffix replacement only modifies the suffix after Split()? We can add something to the prop collector API that says "hey, I'm copying this block but with this new suffix... what's the new prop?" and it could just be identity func for any props that are just concerned with the sql keys / values, to which the suffix is irrelevant.

Now I imagine some prop collectors might be interested in the suffix, like say, mvcc stats, but there too, I think one of the other stipulations of suffix replacement is extremely relevant: all the keys in the file had the same pre-replacement suffix, and are being changed to the same post-replacement suffix, so replacement does not change the relative suffix of any two keys in the file. That makes me optimistic that you still don't need any key-by-key re-computation, but rather some constant update to prop.

Anyway, this is all speculation until we actually implement any props other than the timestamps, but I'm bullish that given the very narrow definition of suffix replacement (all sets, all with the same suffix, all updates to the same new suffix) that we could make most possible prop collectors work without key-by-key recomputaiton, or if we had to, figure out how aggregate partial parallel computation.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Nov 23, 2021

So I tried adding a addProps call during the loop to flush produced blocks , which just decompresses it and iterates over it to pass all keys to the prop collectors sequentially. The impact is pretty striking:

name                                                         old time/op    new time/op    delta
RewriteSST/keys=1000000/RewriteKeySuffixes,concurrency=8-10    15.0ms ± 3%    39.3ms ± 1%  +161.52%  (p=0.000 n=9+10)

name                                                         old speed      new speed      delta
RewriteSST/keys=1000000/RewriteKeySuffixes,concurrency=8-10   804MB/s ± 3%   307MB/s ± 1%   -61.77%  (p=0.000 n=9+10)

name                                                         old alloc/op   new alloc/op   delta
RewriteSST/keys=1000000/RewriteKeySuffixes,concurrency=8-10    18.5MB ± 0%    18.5MB ± 0%    +0.07%  (p=0.000 n=8+9)

name                                                         old allocs/op  new allocs/op  delta
RewriteSST/keys=1000000/RewriteKeySuffixes,concurrency=8-10     24.6k ± 0%     24.6k ± 0%    +0.02%  (p=0.000 n=7+9)

Looking at profile for a run after, it's now just over 20% overall in the addProps with that about evenly split between snappy decompression and blockIter.Next.

That looks like enough of a cost to motivate keeping parallel "computation" of block props for now, perhaps with some formalization in the API, and then weighing other options if/when another form of prop requires something different?

@jbowens
Copy link
Copy Markdown
Contributor

jbowens commented Nov 23, 2021

@dt Yeah, good point. It does seem viable in most use cases to copy over properties that are not dependent on the suffix. Only block properties that depend on the combination of key-value pair and the timestamp would run into difficulties. The block property collector in #1378 is an example of a collector that needs to be computed serially, although practically speaking that particular collector wouldn't ever be used for AddSSTable-rewritten sstables that do not contain multiple MVCC versions of the same key.

Given that we're also implementing #599 right now, I think we should also explore leveraging that work for this use case to see how far it gets us. If we can avoid it, it'd be nice to avoid maintaining two separate sstable concurrency facilities. It might be the case that the concurrency facilities from #599 plus some of the other optimizations you added here (eg, copying the filter block verbatim) can get us where we need to be.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Nov 23, 2021

I doubt just #599 will be sufficient on its own: compression is certainly a big driver of the total time, but only about than a third overall; we're still spending about 20% (now that I removed a bunch of cache.Cache mutex contention) in blockWriter.store for example, mostly in counting out bytes for sharedprefixlen, and getting that parallelized 8x is still helping us out here. As speed-of-light test I added a compression=none case to the benchmark, just to see what would happen if somehow #599 were able to magically completely eliminate compression:

name                                                                        time/op
RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10                      132ms ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10     79.5ms ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10     42.3ms ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10     23.4ms ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10     15.6ms ± 3%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10    15.7ms ± 1%
RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10                             141ms ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10            84.9ms ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10            44.8ms ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10            24.5ms ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10            16.4ms ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10           16.9ms ± 6%

name                                                                        speed
RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10                    361MB/s ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10    600MB/s ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10   1.13GB/s ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10   2.04GB/s ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10   3.05GB/s ± 3%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10  3.05GB/s ± 1%
RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10                          85.8MB/s ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10           142MB/s ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10           269MB/s ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10           493MB/s ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10           738MB/s ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10          716MB/s ± 5%

@dt dt force-pushed the sst-rewrite branch 2 times, most recently from 452f6f6 to 4fb427d Compare November 23, 2021 19:40
@dt dt changed the title sstable: add RewriteKeySuffixes function sstable: add parallelized RewriteKeySuffixes function Nov 23, 2021
@dt dt force-pushed the sst-rewrite branch 4 times, most recently from 84889c4 to 71ffaa0 Compare November 26, 2021 01:03
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Nov 26, 2021

I played with the prop collectors ideas discussed above a little, and I think I found something I'm not wholly opposed to. Code is over on https://github.com/dt/pebble/compare/sst-rewrite...sst-rewrite-prop-collectors?expand=1

TL;DR: an optional extension to the collector interface that a collector can choose to implement if it wants to be usable during replacement, which is one new method that is called for each block with the old prop value for that block and the change in suffix. Technically none of cockroach's current props actually need the old value, since they can derive their state from just from the suffix, so I suppose it is a little bit of wasted extra work to read, alloc and plumb it to them, but I think the increased generality of the API justifies it, and I'm pretty sure that API would support a collector like the one in #1378 (since just updating the suffix can't change whether or not the prefix change is different, so it could elect to just copy its prior value).

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 18, 2022

@jbowens Do you think it'd be easier to review and land this as-is then follow up with block property handling, or pull in the block property handling now and land it all at once?

@dt dt requested a review from jbowens January 21, 2022 03:19
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by comments -- feel free to ignore

Reviewable status: 0 of 14 files reviewed, 6 unresolved discussions (waiting on @dt, @erikgrinaker, and @jbowens)


sstable/suffix_rewriter.go, line 20 at r19 (raw file):

// required that the SST consist of only SETs, and that every key have `from` as
// its suffix, and furthermore the length of from must match the suffix length
// designated by the Split() function if one is set in the WriterOptions.

is the exception if one is set in the WriterOptions no longer relevant since we are enforcing that suffix is the one returned by Split?


sstable/suffix_rewriter.go, line 37 at r19 (raw file):

//
// 1) to be able to copy unmodified filters, it is required that the writer's
// comparer have a Split function and that the replaced suffix of each key not

this would not need to be repeated here, if this is a general requirement.


sstable/suffix_rewriter.go, line 43 at r19 (raw file):

// 2) any block and table property collectors in-use must be able to derive a
// correct result -- for a block wherein *all keys had and will have the same
// suffix* -- from a single updated key per block.

how important is this optimization in terms of performance numbers? I ask since unlike the other restrictions specified here, this one (subjectively) feels like a hack.

Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That speaks to the ongoing maintenance burden of having an additional code path intertwined with low-level sstable writing logic.

Eh, I mean, sure, but I think it's cheaper to update it all at once than it is to go back during a rebase and figure out where "w.restartInterval" went. As far as figuring out the common shared APIs rather than intertwining, I imagine the answer is to merge and then start to see where to draw the lines e.g. would rewriter and writer be happy to share a common indexWriter helper struct, instead of rewriter having a Writer which it uses just for flushing indexes?

Reviewable status: 0 of 14 files reviewed, 6 unresolved discussions (waiting on @dt, @erikgrinaker, @jbowens, and @sumeerbhola)


sstable/suffix_rewriter.go, line 20 at r19 (raw file):

Previously, sumeerbhola wrote…

is the exception if one is set in the WriterOptions no longer relevant since we are enforcing that suffix is the one returned by Split?

Yeah, correct, we're requiring WriterOptions.Comparer.Split be set to call this at all. Will update.


sstable/suffix_rewriter.go, line 37 at r19 (raw file):

Previously, sumeerbhola wrote…

this would not need to be repeated here, if this is a general requirement.

Yeah, this is out of date. Will fix.


sstable/suffix_rewriter.go, line 43 at r19 (raw file):

Previously, sumeerbhola wrote…

how important is this optimization in terms of performance numbers? I ask since unlike the other restrictions specified here, this one (subjectively) feels like a hack.

I opined about this hack (and agree completely that that is what it is) above, and linked to a change that I have queued up to follow this that I think improves on it significantly with an extension to the prop collector API: dt@4207068

@dt dt force-pushed the sst-rewrite branch 2 times, most recently from dffac95 to 9e21d12 Compare January 24, 2022 20:51
Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 14 files reviewed, 6 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)


sstable/suffix_rewriter.go, line 92 at r5 (raw file):

Previously, sumeerbhola wrote…

there is an assumption here that the work per block is not heavily skewed, which is why this static distribution that gives each worker an equal number of blocks is acceptable. Worth noting in a code comment.

Done.

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 4 files at r18, 1 of 1 files at r21.
Reviewable status: 3 of 14 files reviewed, 13 unresolved discussions (waiting on @dt, @erikgrinaker, @jbowens, and @sumeerbhola)


sstable/suffix_rewriter.go, line 20 at r21 (raw file):

-- the an sstable to out containing the sstable in

looks like some of this comment got mangled


sstable/suffix_rewriter.go, line 26 at r21 (raw file):

// simply passed each key from the Reader, and its result is returned. In this
// implementation, all filters and properties are re-derived, just as if the new
// SST had been constructed from arbitrary external keys.

Could we just use the one, concurrent implementation with concurrency=1 if concurrency is set to zero? It seems preferable to just have the one implementation.


sstable/suffix_rewriter.go, line 141 at r21 (raw file):

	}

	// Copy over the filter block if it exists (if it does, copyDataBlocks will

nit: s/copyDataBlocks/rewriteDataBlocksToWriter/ ?


sstable/suffix_rewriter.go, line 181 at r21 (raw file):

	errCh chan error,
	g *sync.WaitGroup,
) {

nit: this is a pretty thin function with many parameters. inline in an anonymous function in the call site?


sstable/suffix_rewriter.go, line 201 at r21 (raw file):

		restartInterval: restartInterval,
	}
	var compressionBuf []byte

can this use the new dataBlockBuf?


sstable/suffix_rewriter.go, line 249 at r21 (raw file):

				return err
			}
			prefixLen := len(key.UserKey) - len(from)

nit: can use si instead of computing from len(key.UserKey) - len(from)


sstable/suffix_rewriter.go, line 305 at r21 (raw file):

		default:
			return errors.Newf("unsupported checksum type: %d", checksumType)
		}

this can be a followup, but this can now use checksummer.checksum.


sstable/suffix_rewriter.go, line 349 at r21 (raw file):

	for i := range blocks {
		// Write the rewritten block to the file.
		n, err := w.writer.Write(blocks[i].data)

This can be a followup, if at all, but this can be done in parallel to individual block rewriting with additional synchronization, right? Do you know how much time is spent in the above block rewriting versus this loop, copying into the memfile and constructing the index?

Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 14 files reviewed, 11 unresolved discussions (waiting on @dt, @erikgrinaker, @jbowens, and @sumeerbhola)


sstable/suffix_rewriter.go, line 26 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Could we just use the one, concurrent implementation with concurrency=1 if concurrency is set to zero? It seems preferable to just have the one implementation.

the 0 vs 1 implementation was mostly for ease of testing. I could just make two separate functions though and then let the caller call the right one. In that case, I think concurrency < 1 is an error.


sstable/suffix_rewriter.go, line 141 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/copyDataBlocks/rewriteDataBlocksToWriter/ ?

Ah, yep, thanks.


sstable/suffix_rewriter.go, line 181 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: this is a pretty thin function with many parameters. inline in an anonymous function in the call site?

Done.


sstable/suffix_rewriter.go, line 201 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can this use the new dataBlockBuf?

Switched to blockBuf.


sstable/suffix_rewriter.go, line 249 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: can use si instead of computing from len(key.UserKey) - len(from)

Done.


sstable/suffix_rewriter.go, line 305 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this can be a followup, but this can now use checksummer.checksum.

compressAndChecksum, right? this removes a bunch of the repeated code here, so that seems like a win. Unfortunately seems to force moving blockBuf to a pointer so I pick up another alloc but I'll switch it for now and then golf the perf a bit later if it is needed.


sstable/suffix_rewriter.go, line 349 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

This can be a followup, if at all, but this can be done in parallel to individual block rewriting with additional synchronization, right? Do you know how much time is spent in the above block rewriting versus this loop, copying into the memfile and constructing the index?

yeah, we could stat another flusher worker and then have some additional synchronization like atomic's for low watermark of each worker to determine what's flushable or something, I just didn't want to do it now.

Profile wise, don't think these memcopies really showed up last I checked -- IIRC it was mostly moving the bytes around during rewriting.

Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 14 files reviewed, 11 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)


sstable/suffix_rewriter.go, line 20 at r21 (raw file):

Previously, jbowens (Jackson Owens) wrote…

-- the an sstable to out containing the sstable in

looks like some of this comment got mangled

Reworked it entirely, to remove the whole this-impl-that-impl thing.

This adds a function to the sstable package that can replace some suffix
with another suffix from every key in an sstable, assuming enforcing
that the sstable consists of only Sets.

If the replacement is passed a filter policy, rather than build new
filters, the filter blocks from the original sstable are copied over,
but only after checking that it used the same filter policy and the same
splitter, and that that splitter splits the same length that as is being
replaced i.e. that the pre-replacement filter remains valid
post-replacement.

Data blocks are rewritten in parallel. This informs a few choices in how
it does so: rewritten keys are not passed to `addPoint`, but rather are
directly handed to a worker-local blockWriter. Finished blocks are not
immediately flushed to the output or added to the index, but rather are
added to a buffer, which is then iterated and flushed at the end.

A significant implication of not passing each key to `addPoint` is that
individual KVs are not passed to the table and block property collectors
during rewriting. Instead, when flushing the finished blocks, each table
collector is passed just the first key of the table and each block prop
collector is passed the first key of each block. For the collectors that
CockroachDB uses, to which only the timestamp key suffix is relevant,
this approach may be sufficient, as during replacement all keys in all
blocks have the same suffix. However extending the collector API could
allow this to be more general, e.g. either by making a separate API for
this approach, like AddOneKeyForBlock, or by making the collectors able
to collect and aggregate partial results, so each rewriter could have a
local partial collector and then merge their results. However as the
simple approach of just passing one key per block to the existing API
appears sufficient for CockroachDB, this is left for later exploration.

Benchmarking this implementation compared to simply opening a writer and
feeding each key from a reader to it shows modest gains even without the
addition of concurrency, just from skipping rebuilding filters and props
from every key, with a 12M SST rewriting at about 33% faster. With the
addition of concurrency=8, this jumps to about a 7.5x speedup over the
naive read/write iteration, of 5x over concurrency=1.

```
name                                                                        time/op
RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10                         29.6µs ± 4%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10         45.9µs ±13%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10         54.9µs ± 8%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10         54.9µs ± 9%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10         56.8µs ± 8%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10        60.7µs ± 7%
RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10                       1.29ms ± 1%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10        877µs ± 2%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10        516µs ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10        369µs ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10        364µs ± 2%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10       417µs ± 2%
RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10                      125ms ± 2%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10     82.6ms ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10     44.4ms ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10     24.6ms ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10     15.0ms ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10    16.0ms ± 1%
RewriteSST/Snappy/keys=100/ReaderWriterLoop-10                                32.9µs ± 4%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10                49.2µs ± 4%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10                60.6µs ± 9%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10                56.9µs ± 6%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10                57.8µs ± 9%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10               61.7µs ± 6%
RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10                              1.39ms ± 2%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10               958µs ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10               547µs ± 2%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10               382µs ± 1%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10               368µs ± 1%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10              424µs ± 2%
RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10                             134ms ± 2%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10            90.2ms ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10            48.5ms ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10            27.1ms ± 2%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10            16.6ms ± 2%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10           17.3ms ± 1%

name                                                                        speed
RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10                        195MB/s ± 4%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10        126MB/s ±12%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10        105MB/s ± 8%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10        105MB/s ± 8%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10        102MB/s ± 7%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10      95.2MB/s ± 7%
RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10                      371MB/s ± 1%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10      545MB/s ± 2%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10      927MB/s ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10     1.29GB/s ± 1%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10     1.31GB/s ± 2%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10    1.15GB/s ± 2%
RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10                    381MB/s ± 2%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10    578MB/s ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10   1.08GB/s ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10   1.94GB/s ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10   3.18GB/s ± 1%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10  2.99GB/s ± 1%
RewriteSST/Snappy/keys=100/ReaderWriterLoop-10                              67.9MB/s ± 4%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10              45.4MB/s ± 4%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10              36.9MB/s ± 8%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10              38.9MB/s ± 9%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10              38.7MB/s ± 9%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10             36.2MB/s ± 6%
RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10                            87.5MB/s ± 2%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10             127MB/s ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10             223MB/s ± 2%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10             319MB/s ± 1%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10             331MB/s ± 1%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10            288MB/s ± 2%
RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10                          90.3MB/s ± 2%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10           134MB/s ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10           249MB/s ± 1%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10           446MB/s ± 2%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10           726MB/s ± 2%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10          697MB/s ± 1%

name                                                                        alloc/op
RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10                          302kB ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10         1.03MB ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10         1.38MB ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10         1.38MB ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10         1.38MB ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10        1.38MB ± 0%
RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10                        551kB ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10       1.13MB ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10       1.87MB ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10       3.36MB ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10       6.32MB ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10      12.2MB ± 0%
RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10                     25.5MB ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10     14.0MB ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10     14.4MB ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10     15.0MB ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10     16.1MB ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10    22.0MB ± 0%
RewriteSST/Snappy/keys=100/ReaderWriterLoop-10                                 302kB ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10                1.03MB ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10                1.38MB ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10                1.38MB ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10                1.38MB ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10               1.38MB ± 0%
RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10                               551kB ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10              1.13MB ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10              1.87MB ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10              3.36MB ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10              6.32MB ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10             12.2MB ± 0%
RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10                            25.5MB ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10            14.0MB ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10            14.4MB ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10            15.0MB ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10            16.1MB ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10           22.0MB ± 0%

name                                                                        allocs/op
RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10                           92.0 ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10           93.0 ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10            103 ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10            103 ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10            103 ± 0%
RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10           103 ± 0%
RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10                          122 ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10          226 ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10          235 ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10          255 ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10          294 ± 0%
RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10         368 ± 0%
RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10                        168 ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10      11.7k ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10      11.7k ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10      11.7k ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10      11.7k ± 0%
RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10     11.8k ± 0%
RewriteSST/Snappy/keys=100/ReaderWriterLoop-10                                  92.0 ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10                  93.0 ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10                   103 ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10                   103 ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10                   103 ± 0%
RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10                  103 ± 0%
RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10                                 122 ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10                 226 ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10                 235 ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10                 256 ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10                 294 ± 0%
RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10                367 ± 0%
RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10                               168 ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10             11.7k ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10             11.7k ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10             11.7k ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10             11.7k ± 0%
RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10            11.8k ± 0%
```
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 27, 2022

@sumeerbhola are you okay with merging this now with the hacky handling of prop collectors and then immediately opening a followup PR specifically for handling them better such as what I linked above? or would you rather pull that change in here first and merge as one when we hash it out?

@erikgrinaker
Copy link
Copy Markdown
Contributor

Can we drop the from parameter? The current KV API doesn't require input SSTs to have uniform timestamps, and it seems like an arbitrary requirement here.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 27, 2022

I think having a from parameter will be helpful in the API for updating block and table properties -- if I start thinking about hypothetical property collectors like mvcc stats, knowing the change being applied to the suffix seems to open up more cases where you could update your property than just knowing the new suffix. The KV API also will effectively always get the from parameter to avoid rewrites entirely outside pushes so it only being able to use the fast rewriter when we have it isn't a big restriction for us?

@erikgrinaker
Copy link
Copy Markdown
Contributor

The KV API also will effectively always get the from parameter to avoid rewrites entirely outside pushes so it only being able to use the fast rewriter when we have it isn't a big restriction for us?

Possibly, unless there are cases where we want to ingest pre-built SSTs that we don't want to bother with rewriting before submitting?

If we want to make this a hard requirement I'm going to have to shuffle some stuff around, but nbd.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 27, 2022

I think we can keep things as they are for now and only use the accelerated rewriter if SSTTimestamp is set since in practice is always will be, and then revisit if/when anybody starts calling AddSSTable with WriteAtRequestTimestamp but not SSTTimestamp

@erikgrinaker
Copy link
Copy Markdown
Contributor

Yeah, let's see when we integrate this into KV. Easier to change the API around now before we have any callers, so if we think it's always going to be set we may as well design around that.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 28, 2022

I'm happy to revisit the from matching requirement once we sort out the block and table props API as well, if we decide it knowing the delta isn't actually required there.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 28, 2022

Is there anything left to resolve on this one before moving on to the follow-ups for property collectors?

@sumeerbhola
Copy link
Copy Markdown
Collaborator

@sumeerbhola are you okay with merging this now with the hacky handling of prop collectors and then immediately opening a followup PR specifically for handling them better such as what I linked above? or would you rather pull that change in here first and merge as one when we hash it out?

hacky handling is fine for this merge.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Jan 31, 2022

Thanks for the reviews!

@dt dt merged commit 17fe1a6 into cockroachdb:master Jan 31, 2022
@dt dt deleted the sst-rewrite branch January 31, 2022 16:23
nicktrav added a commit to nicktrav/cockroach that referenced this pull request Feb 9, 2022
The `SuffixReplaceableBlockCollector` interface was added to to Pebble
in cockroachdb/pebble#1377, which allows a block property collector to
indicate that it supports being updated during suffix replacement.

The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property
collector does not currently support suffix replacement. However, given
a new suffix, the interval for a block can be trivially calculated.

Implement the `SuffixReplaceableBlockCollector` interface.

Release note: None
nicktrav added a commit to nicktrav/cockroach that referenced this pull request Feb 10, 2022
The `SuffixReplaceableBlockCollector` interface was added to Pebble in
cockroachdb/pebble#1377, which allows a block property collector to
indicate that it supports being updated during suffix replacement.

The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property
collector does not currently support suffix replacement. However, given
a new suffix, the interval for a block can be trivially calculated.

Implement the `SuffixReplaceableBlockCollector` interface.

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Feb 12, 2022
75568: util/tracing: prevent spans reuse during registry iteration r=andreimatei a=andreimatei

ActiveSpansRegistry.VisitSpans() was referencing spans without
accounting for the references by incrementing the span's reference
count. This resulted in races between that use and the spans getting
Finish()ed and re-allocated.

Fixes #75380

Release note: None

76327: pkg/storage: support suffix replacement on MVCC interval collector r=sumeerbhola a=nicktrav

The `SuffixReplaceableBlockCollector` interface was added to to Pebble
in cockroachdb/pebble#1377, which allows a block property collector to
indicate that it supports being updated during suffix replacement.

The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property
collector does not currently support suffix replacement. However, given
a new suffix, the interval for a block can be trivially calculated.

Implement the `SuffixReplaceableBlockCollector` interface.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
The `SuffixReplaceableBlockCollector` interface was added to Pebble in
cockroachdb/pebble#1377, which allows a block property collector to
indicate that it supports being updated during suffix replacement.

The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property
collector does not currently support suffix replacement. However, given
a new suffix, the interval for a block can be trivially calculated.

Implement the `SuffixReplaceableBlockCollector` interface.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants