storage: improve range key clearing#85364
Conversation
423a9f1 to
3d6b221
Compare
jbowens
left a comment
There was a problem hiding this comment.
The iffiest bit is the unconditional Pebble range tombstones at the bottom of Raft snapshot SSTs, both because they're unconditional, and also because they will be fairly common. @jbowens Do you consider this problematic? Specifically, this bit here:
I think we'll need cockroachdb/pebble#1839 to not tank iteration perf, but it should be a pretty straightforward change. Once we have that, I don't think it's any worse than the existing state where we include a point key rangedel across the replica's bounds.
Both the point and range tombstones in the snapshot SST are somewhat problematic in that they are wide, and they're included in a large SST that is expensive to re-compact. The artificially extended width of the sstable can be a problem, because it forces ingested sstables that overlap with the tombstone but not the points into higher levels of the LSM. IIRC, there have been attempts to remove the range tombstone from the snapshot SST but they've run into gotchas. It's possible we could mitigate this at the Pebble-level too: cockroachdb/pebble#1838 (comment)
Reviewed 17 of 17 files at r1, 12 of 12 files at r2, 8 of 8 files at r3, 8 of 8 files at r4, 4 of 4 files at r5, 3 of 3 files at r6, 11 of 11 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @pavelkalinnikov)
-- commits line 62 at r5:
nit: maybe we should say 'range key tombstones' instances where we mean specifically the range-key kind.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)
Previously, jbowens (Jackson Owens) wrote…
nit: maybe we should say 'range key tombstones' instances where we mean specifically the range-key kind.
But don't we have both "point" and range tombstones for rangekeys too? A rangekeyunset is a "point" tombstone for range keys, no?
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @pavelkalinnikov)
Previously, erikgrinaker (Erik Grinaker) wrote…
But don't we have both "point" and range tombstones for rangekeys too? A rangekeyunset is a "point" tombstone for range keys, no?
Not in the language we use within Pebble, but I see the similarity. Within Pebble's comments and documentation anything related to the range key keyspace is 'range key ___', and both RANGEKEYDEL and RANGEKEYUNSET are tombstones within the range key keyspace.
My main objection to "Pebble range tombstone" is that historically and within Pebble, this means a RANGEDEL. I'm somewhat regretting not naming range keys 'span keys' or something else appreciably different.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)
-- commits line 62 at r5:
Right, but in this case, "Pebble range tombstone" does mean a RANGEDEL:
will no longer drop Pebble range tombstones across the range key span unless there are actual range keys present..
Previously, we would drop a RANGEDEL across the range key space regardless of whether there were range keys present.
I've tried to be reasonably consistent with only using "Pebble range tombstones" about RANGEDELS.
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @pavelkalinnikov)
Previously, erikgrinaker (Erik Grinaker) wrote…
Right, but in this case, "Pebble range tombstone" does mean a RANGEDEL:
will no longer drop Pebble range tombstones across the range key span unless there are actual range keys present..
Previously, we would drop a RANGEDEL across the range key space regardless of whether there were range keys present.
I've tried to be reasonably consistent with only using "Pebble range tombstones" about RANGEDELS.
I'm still confused. "Pebble range tombstones" have no affect on range keys, so they shouldn't be conditional on whether there exists a range key present, right?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)
Previously, jbowens (Jackson Owens) wrote…
I'm still confused. "Pebble range tombstones" have no affect on range keys, so they shouldn't be conditional on whether there exists a range key present, right?
Ah sorry, I mixed up RANGEDEL and RANGEKEYDEL. I see what you're saying, will try to reword.
This patch adds boolean parameters for point/range keys to `Writer.ClearRawRange()`. When true, it will unconditionally write Pebble range tombstones across the point/range key spans. This gives the caller better control over when to write tombstones, to avoid writing them unnecessarily. This is a behavioral change, because the previous `Engine` and `Batch` implementations would only drop Pebble range key tombstones if there were range keys present. Later commits will introduce higher-level functionality to avoid writing these tombstones unnecessarily and adapt callers. Release note: None
Previously, `Writer.ClearMVCCIteratorRange` would use a Pebble `RANGEKEYDEL` to clear all range keys within the span. This went against the intent of the method, which is to iterate across the the span and clear individual range keys. This patch changes is to do the latter. Parameters are also added to control whether to clear point and/or range keys. Release note: None
This patch removes `Writer.ClearAllRangeKeys()`, which was used to clear all range keys in a span using a single `RANGEKEYDEL` tombstone. `ClearRawRange` should be used instead. This also removes some clumsy `Engine` logic that attempted to detect and avoid writing range key tombstones if there were no range keys, which required additional metadata tracking in batches and such. The responsibility for this is instead placed on callers. It was never possible to handle this sufficiently inside the writers anyway, because of e.g. `SSTWriter` which does not have access to a reader. Release note: None
This patch adds `Writer.ClearEngineRangeKey()`, and makes `ClearMVCCRangeKey()` call through to it. It also takes the opportunity to consolidate the put logic in a similar fashion, by calling through to the engine method. Release note: None
This patch improves `ClearRangeWithHeuristic()` to take individual, optional thresholds for point and/or range keys, controlling when to use `RANGEDEL` or `RANGEKEYDEL` instead of individual tombstones. The Raft replica clearing code has been adapted to take advantage of this, and in particular, will no longer drop Pebble range key tombstones unless there are actual range keys present. Release note: None
This patch improves range key clearing in `ClearRange`, by using `ClearRangeWithHeuristic` to only write Pebble range key tombstones if there are any range keys present. Release note: None
This patch adds point/range key parameters to `Writer.ClearMVCCRange()` controlling whether to clear the key type. This can be used by callers to avoid dropping unnecessary tombstones, particularly across range key spans that are expected to be empty most of the time. Release note: None
3d6b221 to
99c3057
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
I think we'll need cockroachdb/pebble#1839 to not tank iteration perf, but it should be a pretty straightforward change.
I think we'll need cockroachdb/pebble#1839 to not tank iteration perf, but it should be a pretty straightforward change.
I'm kind of surprised that we're not already seeing this show up in roachtest benchmarks. Maybe we don't end up actually sending any snapshots that often. In any case, yeah, that seems worthwhile.
Thanks for the background here, sounds like these range deletions aren't an immediate problem then.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)
Previously, erikgrinaker (Erik Grinaker) wrote…
Ah sorry, I mixed up RANGEDEL and RANGEKEYDEL. I see what you're saying, will try to reword.
Updated the commit messages and comments. I still refer to RANGEDEL and RANGEKEYDEL collectively as Pebble range tombstones in a few places where it would be awkward or overly verbose to be more specific -- I think it's clear enough from the context.
|
TFTR! bors r=jbowens |
|
Build succeeded: |
This patch set improves clearing of range keys, with the overall goal of avoiding dropping Pebble range tombstones unnecessarily, in particular across range key spans that will most often be empty. The main affected components are:
ClearRangegRPC method: similarly to point keys, it now uses an iterator to clear individual range keys for small data sets. For larger data sets, it will use Pebble range tombstones to clear point or range keys separately, but only if any such keys exist.Raft snapshots
Raft splits: on RHS conflict, iterates and clears any RHS point/range keys (individually or ranged).
Raft merges: for RHS range-local data, iterates and clears any RHS point/range keys (individually or ranged).
Raft replica removal: iterates and clears any RHS point/range keys (individually or ranged).
Raft log truncation: no change, only writes tombstones across point keys.
The iffiest bit is the unconditional Pebble range tombstones at the bottom of Raft snapshot SSTs, both because they're unconditional, and also because they will be fairly common. @jbowens Do you consider this problematic? Specifically, this bit here:
cockroach/pkg/kv/kvserver/store_snapshot.go
Lines 183 to 185 in b0e76dc
Resolves #83032.
storage: add point/range key params to
Writer.ClearRawRange()This patch adds boolean parameters for point/range keys to
Writer.ClearRawRange(). When true, it will unconditionally writePebble range tombstones across the point/range key spans. This gives the
caller better control over when to write tombstones, to avoid writing
them unnecessarily.
This is a behavioral change, because the previous
EngineandBatchimplementations would only drop Pebble range key tombstones if there
were range keys present. Later commits will introduce higher-level
functionality to avoid writing these tombstones unnecessarily and adapt
callers.
Release note: None
storage: clear individual range keys in
ClearMVCCIteratorRangePreviously,
Writer.ClearMVCCIteratorRangewould use a PebbleRANGEKEYDELto clear all range keys within the span. This went againstthe intent of the method, which is to iterate across the the span and
clear individual range keys. This patch changes is to do the latter.
Parameters are also added to control whether to clear point and/or range
keys.
Release note: None
storage: remove
Writer.ClearAllRangeKeysThis patch removes
Writer.ClearAllRangeKeys(), which was used to clearall range keys in a span using a single
RANGEKEYDELtombstone.ClearRawRangeshould be used instead.This also removes some clumsy
Enginelogic that attempted to detectand avoid writing range key tombstones if there were no range keys,
which required additional metadata tracking in batches and such. The
responsibility for this is instead placed on callers. It was never
possible to handle this sufficiently inside the writers anyway, because
of e.g.
SSTWriterwhich does not have access to a reader.Release note: None
storage: add
Writer.ClearEngineRangeKey()This patch adds
Writer.ClearEngineRangeKey(), and makesClearMVCCRangeKey()call through to it. It also takes the opportunityto consolidate the put logic in a similar fashion, by calling through to
the engine method.
Release note: None
storage: improve
ClearRangeWithHeuristicfor range keysThis patch improves
ClearRangeWithHeuristic()to take individual,optional thresholds for point and/or range keys, controlling when to use
RANGEDELorRANGEKEYDELinstead of individual tombstones. The Raftreplica clearing code has been adapted to take advantage of this, and in
particular, will no longer drop Pebble range key tombstones unless there
are actual range keys present.
Release note: None
kvserver/batcheval: improve range key clears in
ClearRangeThis patch improves range key clearing in
ClearRange, by usingClearRangeWithHeuristicto only write Pebble range key tombstones ifthere are any range keys present.
Release note: None
storage: add point/range key params to
Writer.ClearMVCCRangeThis patch adds point/range key parameters to
Writer.ClearMVCCRange()controlling whether to clear the key type. This can be used by callers
to avoid dropping unnecessary tombstones, particularly across range key
spans that are expected to be empty most of the time.
Release note: None