sstable: add parallelized RewriteKeySuffixes function#1377
sstable: add parallelized RewriteKeySuffixes function#1377dt merged 1 commit intocockroachdb:masterfrom
Conversation
jbowens
left a comment
There was a problem hiding this comment.
Nice!
We do anticipate computing block property collectors over data beyond just the key's timestamp. We've been considering explorations around pushing down SQL filters and projections into Pebble, and using block property collectors to aggregate statistics over keys within a block.
Concurrency while reading/decompressing and writing/compressing is desirable more generally (eg, #599), and I'm hopeful that we can combine these use cases into one facility that offers parallelism. I suspect that we can get the performance we need while doing a sequential iteration that computes block property collectors sequentially but defers everything else to parallel workers.
Reviewable status: 0 of 7 files reviewed, all discussions resolved (waiting on @erikgrinaker and @sumeerbhola)
279a02b to
2b3cd36
Compare
|
@jbowens I'm not too worried about future table and block prop collectors that require seeing every kv. I think for table prop collectors we could extend their API to have a |
sumeerbhola
left a comment
There was a problem hiding this comment.
Data blocks are rewritten in parallel.
Do you have a cpu profile of this benchmark. I am curious what are the big consumers of cpu in the part that has been parallelized.
Reviewable status: 0 of 7 files reviewed, 1 unresolved discussion (waiting on @dt and @erikgrinaker)
sstable/suffix_rewriter.go, line 92 at r5 (raw file):
} for i := worker; i < len(input); i += totalWorkers {
there is an assumption here that the work per block is not heavily skewed, which is why this static distribution that gives each worker an equal number of blocks is acceptable. Worth noting in a code comment.
I can post an actual profile later, but when I profiled it earlier it looked like the biggest was compression and decompression, with this implementation spending about 21% in compression vs 12% in loop reader/writer and 9% vs 8% in decompression. blockWriter.store was pretty large too IIRC. I looked briefly at allocs and tried buffering the Clone allocs which cut overall allocs by >30%, but that yielded no significant top line perf savings. |
|
@jbowens Another thing that occurred to me last night: ideally we'd copy over most higher level logical props like that blindly, instead of recomputing them, since the whole point is to minimize cost. Maybe very significant here is the (enforced) stipulation that suffix replacement only modifies the suffix after Now I imagine some prop collectors might be interested in the suffix, like say, mvcc stats, but there too, I think one of the other stipulations of suffix replacement is extremely relevant: all the keys in the file had the same pre-replacement suffix, and are being changed to the same post-replacement suffix, so replacement does not change the relative suffix of any two keys in the file. That makes me optimistic that you still don't need any key-by-key re-computation, but rather some constant update to prop. Anyway, this is all speculation until we actually implement any props other than the timestamps, but I'm bullish that given the very narrow definition of suffix replacement (all sets, all with the same suffix, all updates to the same new suffix) that we could make most possible prop collectors work without key-by-key recomputaiton, or if we had to, figure out how aggregate partial parallel computation. |
|
So I tried adding a Looking at profile for a run after, it's now just over 20% overall in the That looks like enough of a cost to motivate keeping parallel "computation" of block props for now, perhaps with some formalization in the API, and then weighing other options if/when another form of prop requires something different? |
|
@dt Yeah, good point. It does seem viable in most use cases to copy over properties that are not dependent on the suffix. Only block properties that depend on the combination of key-value pair and the timestamp would run into difficulties. The block property collector in #1378 is an example of a collector that needs to be computed serially, although practically speaking that particular collector wouldn't ever be used for AddSSTable-rewritten sstables that do not contain multiple MVCC versions of the same key. Given that we're also implementing #599 right now, I think we should also explore leveraging that work for this use case to see how far it gets us. If we can avoid it, it'd be nice to avoid maintaining two separate sstable concurrency facilities. It might be the case that the concurrency facilities from #599 plus some of the other optimizations you added here (eg, copying the filter block verbatim) can get us where we need to be. |
|
I doubt just #599 will be sufficient on its own: compression is certainly a big driver of the total time, but only about than a third overall; we're still spending about 20% (now that I removed a bunch of cache.Cache mutex contention) in blockWriter.store for example, mostly in counting out bytes for sharedprefixlen, and getting that parallelized 8x is still helping us out here. As speed-of-light test I added a compression=none case to the benchmark, just to see what would happen if somehow #599 were able to magically completely eliminate compression: |
452f6f6 to
4fb427d
Compare
84889c4 to
71ffaa0
Compare
|
I played with the prop collectors ideas discussed above a little, and I think I found something I'm not wholly opposed to. Code is over on https://github.com/dt/pebble/compare/sst-rewrite...sst-rewrite-prop-collectors?expand=1 TL;DR: an optional extension to the collector interface that a collector can choose to implement if it wants to be usable during replacement, which is one new method that is called for each block with the old prop value for that block and the change in suffix. Technically none of cockroach's current props actually need the old value, since they can derive their state from just from the suffix, so I suppose it is a little bit of wasted extra work to read, alloc and plumb it to them, but I think the increased generality of the API justifies it, and I'm pretty sure that API would support a collector like the one in #1378 (since just updating the suffix can't change whether or not the prefix change is different, so it could elect to just copy its prior value). |
|
@jbowens Do you think it'd be easier to review and land this as-is then follow up with block property handling, or pull in the block property handling now and land it all at once? |
sumeerbhola
left a comment
There was a problem hiding this comment.
drive-by comments -- feel free to ignore
Reviewable status: 0 of 14 files reviewed, 6 unresolved discussions (waiting on @dt, @erikgrinaker, and @jbowens)
sstable/suffix_rewriter.go, line 20 at r19 (raw file):
// required that the SST consist of only SETs, and that every key have `from` as // its suffix, and furthermore the length of from must match the suffix length // designated by the Split() function if one is set in the WriterOptions.
is the exception if one is set in the WriterOptions no longer relevant since we are enforcing that suffix is the one returned by Split?
sstable/suffix_rewriter.go, line 37 at r19 (raw file):
// // 1) to be able to copy unmodified filters, it is required that the writer's // comparer have a Split function and that the replaced suffix of each key not
this would not need to be repeated here, if this is a general requirement.
sstable/suffix_rewriter.go, line 43 at r19 (raw file):
// 2) any block and table property collectors in-use must be able to derive a // correct result -- for a block wherein *all keys had and will have the same // suffix* -- from a single updated key per block.
how important is this optimization in terms of performance numbers? I ask since unlike the other restrictions specified here, this one (subjectively) feels like a hack.
dt
left a comment
There was a problem hiding this comment.
That speaks to the ongoing maintenance burden of having an additional code path intertwined with low-level sstable writing logic.
Eh, I mean, sure, but I think it's cheaper to update it all at once than it is to go back during a rebase and figure out where "w.restartInterval" went. As far as figuring out the common shared APIs rather than intertwining, I imagine the answer is to merge and then start to see where to draw the lines e.g. would rewriter and writer be happy to share a common indexWriter helper struct, instead of rewriter having a Writer which it uses just for flushing indexes?
Reviewable status: 0 of 14 files reviewed, 6 unresolved discussions (waiting on @dt, @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/suffix_rewriter.go, line 20 at r19 (raw file):
Previously, sumeerbhola wrote…
is the exception
if one is set in the WriterOptionsno longer relevant since we are enforcing that suffix is the one returned bySplit?
Yeah, correct, we're requiring WriterOptions.Comparer.Split be set to call this at all. Will update.
sstable/suffix_rewriter.go, line 37 at r19 (raw file):
Previously, sumeerbhola wrote…
this would not need to be repeated here, if this is a general requirement.
Yeah, this is out of date. Will fix.
sstable/suffix_rewriter.go, line 43 at r19 (raw file):
Previously, sumeerbhola wrote…
how important is this optimization in terms of performance numbers? I ask since unlike the other restrictions specified here, this one (subjectively) feels like a hack.
I opined about this hack (and agree completely that that is what it is) above, and linked to a change that I have queued up to follow this that I think improves on it significantly with an extension to the prop collector API: dt@4207068
dffac95 to
9e21d12
Compare
dt
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 14 files reviewed, 6 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/suffix_rewriter.go, line 92 at r5 (raw file):
Previously, sumeerbhola wrote…
there is an assumption here that the work per block is not heavily skewed, which is why this static distribution that gives each worker an equal number of blocks is acceptable. Worth noting in a code comment.
Done.
jbowens
left a comment
There was a problem hiding this comment.
Reviewed 2 of 4 files at r18, 1 of 1 files at r21.
Reviewable status: 3 of 14 files reviewed, 13 unresolved discussions (waiting on @dt, @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/suffix_rewriter.go, line 20 at r21 (raw file):
-- the an sstable to
outcontaining the sstable in
looks like some of this comment got mangled
sstable/suffix_rewriter.go, line 26 at r21 (raw file):
// simply passed each key from the Reader, and its result is returned. In this // implementation, all filters and properties are re-derived, just as if the new // SST had been constructed from arbitrary external keys.
Could we just use the one, concurrent implementation with concurrency=1 if concurrency is set to zero? It seems preferable to just have the one implementation.
sstable/suffix_rewriter.go, line 141 at r21 (raw file):
} // Copy over the filter block if it exists (if it does, copyDataBlocks will
nit: s/copyDataBlocks/rewriteDataBlocksToWriter/ ?
sstable/suffix_rewriter.go, line 181 at r21 (raw file):
errCh chan error, g *sync.WaitGroup, ) {
nit: this is a pretty thin function with many parameters. inline in an anonymous function in the call site?
sstable/suffix_rewriter.go, line 201 at r21 (raw file):
restartInterval: restartInterval, } var compressionBuf []byte
can this use the new dataBlockBuf?
sstable/suffix_rewriter.go, line 249 at r21 (raw file):
return err } prefixLen := len(key.UserKey) - len(from)
nit: can use si instead of computing from len(key.UserKey) - len(from)
sstable/suffix_rewriter.go, line 305 at r21 (raw file):
default: return errors.Newf("unsupported checksum type: %d", checksumType) }
this can be a followup, but this can now use checksummer.checksum.
sstable/suffix_rewriter.go, line 349 at r21 (raw file):
for i := range blocks { // Write the rewritten block to the file. n, err := w.writer.Write(blocks[i].data)
This can be a followup, if at all, but this can be done in parallel to individual block rewriting with additional synchronization, right? Do you know how much time is spent in the above block rewriting versus this loop, copying into the memfile and constructing the index?
dt
left a comment
There was a problem hiding this comment.
Reviewable status: 3 of 14 files reviewed, 11 unresolved discussions (waiting on @dt, @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/suffix_rewriter.go, line 26 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Could we just use the one, concurrent implementation with
concurrency=1ifconcurrencyis set to zero? It seems preferable to just have the one implementation.
the 0 vs 1 implementation was mostly for ease of testing. I could just make two separate functions though and then let the caller call the right one. In that case, I think concurrency < 1 is an error.
sstable/suffix_rewriter.go, line 141 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: s/copyDataBlocks/rewriteDataBlocksToWriter/ ?
Ah, yep, thanks.
sstable/suffix_rewriter.go, line 181 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: this is a pretty thin function with many parameters. inline in an anonymous function in the call site?
Done.
sstable/suffix_rewriter.go, line 201 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
can this use the new
dataBlockBuf?
Switched to blockBuf.
sstable/suffix_rewriter.go, line 249 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: can use
siinstead of computing fromlen(key.UserKey) - len(from)
Done.
sstable/suffix_rewriter.go, line 305 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
this can be a followup, but this can now use
checksummer.checksum.
compressAndChecksum, right? this removes a bunch of the repeated code here, so that seems like a win. Unfortunately seems to force moving blockBuf to a pointer so I pick up another alloc but I'll switch it for now and then golf the perf a bit later if it is needed.
sstable/suffix_rewriter.go, line 349 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
This can be a followup, if at all, but this can be done in parallel to individual block rewriting with additional synchronization, right? Do you know how much time is spent in the above block rewriting versus this loop, copying into the memfile and constructing the index?
yeah, we could stat another flusher worker and then have some additional synchronization like atomic's for low watermark of each worker to determine what's flushable or something, I just didn't want to do it now.
Profile wise, don't think these memcopies really showed up last I checked -- IIRC it was mostly moving the bytes around during rewriting.
dt
left a comment
There was a problem hiding this comment.
Reviewable status: 1 of 14 files reviewed, 11 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/suffix_rewriter.go, line 20 at r21 (raw file):
Previously, jbowens (Jackson Owens) wrote…
-- the an sstable to
outcontaining the sstable inlooks like some of this comment got mangled
Reworked it entirely, to remove the whole this-impl-that-impl thing.
This adds a function to the sstable package that can replace some suffix with another suffix from every key in an sstable, assuming enforcing that the sstable consists of only Sets. If the replacement is passed a filter policy, rather than build new filters, the filter blocks from the original sstable are copied over, but only after checking that it used the same filter policy and the same splitter, and that that splitter splits the same length that as is being replaced i.e. that the pre-replacement filter remains valid post-replacement. Data blocks are rewritten in parallel. This informs a few choices in how it does so: rewritten keys are not passed to `addPoint`, but rather are directly handed to a worker-local blockWriter. Finished blocks are not immediately flushed to the output or added to the index, but rather are added to a buffer, which is then iterated and flushed at the end. A significant implication of not passing each key to `addPoint` is that individual KVs are not passed to the table and block property collectors during rewriting. Instead, when flushing the finished blocks, each table collector is passed just the first key of the table and each block prop collector is passed the first key of each block. For the collectors that CockroachDB uses, to which only the timestamp key suffix is relevant, this approach may be sufficient, as during replacement all keys in all blocks have the same suffix. However extending the collector API could allow this to be more general, e.g. either by making a separate API for this approach, like AddOneKeyForBlock, or by making the collectors able to collect and aggregate partial results, so each rewriter could have a local partial collector and then merge their results. However as the simple approach of just passing one key per block to the existing API appears sufficient for CockroachDB, this is left for later exploration. Benchmarking this implementation compared to simply opening a writer and feeding each key from a reader to it shows modest gains even without the addition of concurrency, just from skipping rebuilding filters and props from every key, with a 12M SST rewriting at about 33% faster. With the addition of concurrency=8, this jumps to about a 7.5x speedup over the naive read/write iteration, of 5x over concurrency=1. ``` name time/op RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10 29.6µs ± 4% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10 45.9µs ±13% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10 54.9µs ± 8% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10 54.9µs ± 9% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10 56.8µs ± 8% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10 60.7µs ± 7% RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10 1.29ms ± 1% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10 877µs ± 2% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10 516µs ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10 369µs ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10 364µs ± 2% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10 417µs ± 2% RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10 125ms ± 2% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10 82.6ms ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10 44.4ms ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10 24.6ms ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10 15.0ms ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10 16.0ms ± 1% RewriteSST/Snappy/keys=100/ReaderWriterLoop-10 32.9µs ± 4% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10 49.2µs ± 4% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10 60.6µs ± 9% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10 56.9µs ± 6% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10 57.8µs ± 9% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10 61.7µs ± 6% RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10 1.39ms ± 2% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10 958µs ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10 547µs ± 2% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10 382µs ± 1% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10 368µs ± 1% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10 424µs ± 2% RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10 134ms ± 2% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10 90.2ms ± 1% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10 48.5ms ± 1% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10 27.1ms ± 2% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10 16.6ms ± 2% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10 17.3ms ± 1% name speed RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10 195MB/s ± 4% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10 126MB/s ±12% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10 105MB/s ± 8% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10 105MB/s ± 8% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10 102MB/s ± 7% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10 95.2MB/s ± 7% RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10 371MB/s ± 1% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10 545MB/s ± 2% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10 927MB/s ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10 1.29GB/s ± 1% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10 1.31GB/s ± 2% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10 1.15GB/s ± 2% RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10 381MB/s ± 2% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10 578MB/s ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10 1.08GB/s ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10 1.94GB/s ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10 3.18GB/s ± 1% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10 2.99GB/s ± 1% RewriteSST/Snappy/keys=100/ReaderWriterLoop-10 67.9MB/s ± 4% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10 45.4MB/s ± 4% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10 36.9MB/s ± 8% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10 38.9MB/s ± 9% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10 38.7MB/s ± 9% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10 36.2MB/s ± 6% RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10 87.5MB/s ± 2% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10 127MB/s ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10 223MB/s ± 2% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10 319MB/s ± 1% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10 331MB/s ± 1% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10 288MB/s ± 2% RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10 90.3MB/s ± 2% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10 134MB/s ± 1% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10 249MB/s ± 1% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10 446MB/s ± 2% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10 726MB/s ± 2% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10 697MB/s ± 1% name alloc/op RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10 302kB ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10 1.03MB ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10 1.38MB ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10 1.38MB ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10 1.38MB ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10 1.38MB ± 0% RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10 551kB ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10 1.13MB ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10 1.87MB ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10 3.36MB ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10 6.32MB ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10 12.2MB ± 0% RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10 25.5MB ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10 14.0MB ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10 14.4MB ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10 15.0MB ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10 16.1MB ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10 22.0MB ± 0% RewriteSST/Snappy/keys=100/ReaderWriterLoop-10 302kB ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10 1.03MB ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10 1.38MB ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10 1.38MB ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10 1.38MB ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10 1.38MB ± 0% RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10 551kB ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10 1.13MB ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10 1.87MB ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10 3.36MB ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10 6.32MB ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10 12.2MB ± 0% RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10 25.5MB ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10 14.0MB ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10 14.4MB ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10 15.0MB ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10 16.1MB ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10 22.0MB ± 0% name allocs/op RewriteSST/NoCompression/keys=100/ReaderWriterLoop-10 92.0 ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=1-10 93.0 ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=2-10 103 ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=4-10 103 ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=8-10 103 ± 0% RewriteSST/NoCompression/keys=100/RewriteKeySuffixes,concurrency=16-10 103 ± 0% RewriteSST/NoCompression/keys=10000/ReaderWriterLoop-10 122 ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=1-10 226 ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=2-10 235 ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=4-10 255 ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=8-10 294 ± 0% RewriteSST/NoCompression/keys=10000/RewriteKeySuffixes,concurrency=16-10 368 ± 0% RewriteSST/NoCompression/keys=1000000/ReaderWriterLoop-10 168 ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=1-10 11.7k ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=2-10 11.7k ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=4-10 11.7k ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=8-10 11.7k ± 0% RewriteSST/NoCompression/keys=1000000/RewriteKeySuffixes,concurrency=16-10 11.8k ± 0% RewriteSST/Snappy/keys=100/ReaderWriterLoop-10 92.0 ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=1-10 93.0 ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=2-10 103 ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=4-10 103 ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=8-10 103 ± 0% RewriteSST/Snappy/keys=100/RewriteKeySuffixes,concurrency=16-10 103 ± 0% RewriteSST/Snappy/keys=10000/ReaderWriterLoop-10 122 ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=1-10 226 ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=2-10 235 ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=4-10 256 ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=8-10 294 ± 0% RewriteSST/Snappy/keys=10000/RewriteKeySuffixes,concurrency=16-10 367 ± 0% RewriteSST/Snappy/keys=1000000/ReaderWriterLoop-10 168 ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=1-10 11.7k ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=2-10 11.7k ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=4-10 11.7k ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=8-10 11.7k ± 0% RewriteSST/Snappy/keys=1000000/RewriteKeySuffixes,concurrency=16-10 11.8k ± 0% ```
|
@sumeerbhola are you okay with merging this now with the hacky handling of prop collectors and then immediately opening a followup PR specifically for handling them better such as what I linked above? or would you rather pull that change in here first and merge as one when we hash it out? |
|
Can we drop the |
|
I think having a |
Possibly, unless there are cases where we want to ingest pre-built SSTs that we don't want to bother with rewriting before submitting? If we want to make this a hard requirement I'm going to have to shuffle some stuff around, but nbd. |
|
I think we can keep things as they are for now and only use the accelerated rewriter if SSTTimestamp is set since in practice is always will be, and then revisit if/when anybody starts calling AddSSTable with WriteAtRequestTimestamp but not SSTTimestamp |
|
Yeah, let's see when we integrate this into KV. Easier to change the API around now before we have any callers, so if we think it's always going to be set we may as well design around that. |
|
I'm happy to revisit the |
|
Is there anything left to resolve on this one before moving on to the follow-ups for property collectors? |
hacky handling is fine for this merge. |
|
Thanks for the reviews! |
The `SuffixReplaceableBlockCollector` interface was added to to Pebble in cockroachdb/pebble#1377, which allows a block property collector to indicate that it supports being updated during suffix replacement. The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property collector does not currently support suffix replacement. However, given a new suffix, the interval for a block can be trivially calculated. Implement the `SuffixReplaceableBlockCollector` interface. Release note: None
The `SuffixReplaceableBlockCollector` interface was added to Pebble in cockroachdb/pebble#1377, which allows a block property collector to indicate that it supports being updated during suffix replacement. The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property collector does not currently support suffix replacement. However, given a new suffix, the interval for a block can be trivially calculated. Implement the `SuffixReplaceableBlockCollector` interface. Release note: None
75568: util/tracing: prevent spans reuse during registry iteration r=andreimatei a=andreimatei ActiveSpansRegistry.VisitSpans() was referencing spans without accounting for the references by incrementing the span's reference count. This resulted in races between that use and the spans getting Finish()ed and re-allocated. Fixes #75380 Release note: None 76327: pkg/storage: support suffix replacement on MVCC interval collector r=sumeerbhola a=nicktrav The `SuffixReplaceableBlockCollector` interface was added to to Pebble in cockroachdb/pebble#1377, which allows a block property collector to indicate that it supports being updated during suffix replacement. The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property collector does not currently support suffix replacement. However, given a new suffix, the interval for a block can be trivially calculated. Implement the `SuffixReplaceableBlockCollector` interface. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com>
The `SuffixReplaceableBlockCollector` interface was added to Pebble in cockroachdb/pebble#1377, which allows a block property collector to indicate that it supports being updated during suffix replacement. The existing `pebbleDataBlockMVCCTimeIntervalCollector` block property collector does not currently support suffix replacement. However, given a new suffix, the interval for a block can be trivially calculated. Implement the `SuffixReplaceableBlockCollector` interface. Release note: None
This adds a function to the sstable package that can replace some suffix
with another suffix from every key in an sstable, assuming enforcing
that the sstable consists of only Sets.
If the replacement is passed a filter policy, rather than build new
filters, the filter blocks from the original sstable are copied over,
but only after checking that it used the same filter policy and the same
splitter, and that that splitter splits the same length that as is being
replaced i.e. that the pre-replacement filter remains valid
post-replacement.
Data blocks are rewritten in parallel. This informs a few choices in how
it does so: rewritten keys are not passed to
addPoint, but rather aredirectly handed to a worker-local blockWriter. Finished blocks are not
immediately flushed to the output or added to the index, but rather are
added to a buffer, which is then iterated and flushed at the end.
A significant implication of not passing each key to
addPointis thatindividual KVs are not passed to the table and block property collectors
during rewriting. Instead, when flushing the finished blocks, each table
collector is passed just the first key of the table and each block prop
collector is passed the first key of each block. For the collectors that
CockroachDB uses, to which only the timestamp key suffix is relevant,
this approach may be sufficient, as during replacement all keys in all
blocks have the same suffix. However extending the collector API could
allow this to be more general, e.g. either by making a separate API for
this approach, like AddOneKeyForBlock, or by making the collectors able
to collect and aggregate partial results, so each rewriter could have a
local partial collector and then merge their results. However as the
simple approach of just passing one key per block to the existing API
appears sufficient for CockroachDB, this is left for later exploration.
Benchmarking this implementation compared to simply opening a writer and
feeding each key from a reader to it shows modest gains even without the
addition of concurrency, just from skipping rebuilding filters and props
from every key, with a 12M SST rewriting at about 33% faster. With the
addition of concurrency=8, this jumps to about a 7.5x speedup over the
naive read/write iteration, of 5x over concurrency=1.