Skip to content

storage: improve range key clearing#85364

Merged
craig[bot] merged 7 commits intocockroachdb:masterfrom
erikgrinaker:mvcc-range-tombstones-clear-redux
Aug 1, 2022
Merged

storage: improve range key clearing#85364
craig[bot] merged 7 commits intocockroachdb:masterfrom
erikgrinaker:mvcc-range-tombstones-clear-redux

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented Jul 30, 2022

This patch set improves clearing of range keys, with the overall goal of avoiding dropping Pebble range tombstones unnecessarily, in particular across range key spans that will most often be empty. The main affected components are:

  • ClearRange gRPC method: similarly to point keys, it now uses an iterator to clear individual range keys for small data sets. For larger data sets, it will use Pebble range tombstones to clear point or range keys separately, but only if any such keys exist.

  • Raft snapshots

    • Unreplicated data: no change, we only write a Pebble range tombstone across the point key span, since we don't expect range keys here.
    • Subsumed replicas' range-local state: iterates and clears any point/range keys (individually or ranged).
    • Snapshot SSTs: every SST (for each key span in the Raft range) unconditionally has a Pebble range tombstone at the bottom, both across point keys and range keys.
  • Raft splits: on RHS conflict, iterates and clears any RHS point/range keys (individually or ranged).

  • Raft merges: for RHS range-local data, iterates and clears any RHS point/range keys (individually or ranged).

  • Raft replica removal: iterates and clears any RHS point/range keys (individually or ranged).

  • Raft log truncation: no change, only writes tombstones across point keys.

The iffiest bit is the unconditional Pebble range tombstones at the bottom of Raft snapshot SSTs, both because they're unconditional, and also because they will be fairly common. @jbowens Do you consider this problematic? Specifically, this bit here:

if err := msstw.currSST.ClearRawRange(
msstw.keySpans[msstw.currSpan].Key, msstw.keySpans[msstw.currSpan].EndKey,
true /* pointKeys */, true, /* rangeKeys */

Resolves #83032.


storage: add point/range key params to Writer.ClearRawRange()

This patch adds boolean parameters for point/range keys to
Writer.ClearRawRange(). When true, it will unconditionally write
Pebble range tombstones across the point/range key spans. This gives the
caller better control over when to write tombstones, to avoid writing
them unnecessarily.

This is a behavioral change, because the previous Engine and Batch
implementations would only drop Pebble range key tombstones if there
were range keys present. Later commits will introduce higher-level
functionality to avoid writing these tombstones unnecessarily and adapt
callers.

Release note: None

storage: clear individual range keys in ClearMVCCIteratorRange

Previously, Writer.ClearMVCCIteratorRange would use a Pebble
RANGEKEYDEL to clear all range keys within the span. This went against
the intent of the method, which is to iterate across the the span and
clear individual range keys. This patch changes is to do the latter.

Parameters are also added to control whether to clear point and/or range
keys.

Release note: None

storage: remove Writer.ClearAllRangeKeys

This patch removes Writer.ClearAllRangeKeys(), which was used to clear
all range keys in a span using a single RANGEKEYDEL tombstone.
ClearRawRange should be used instead.

This also removes some clumsy Engine logic that attempted to detect
and avoid writing range key tombstones if there were no range keys,
which required additional metadata tracking in batches and such. The
responsibility for this is instead placed on callers. It was never
possible to handle this sufficiently inside the writers anyway, because
of e.g. SSTWriter which does not have access to a reader.

Release note: None

storage: add Writer.ClearEngineRangeKey()

This patch adds Writer.ClearEngineRangeKey(), and makes
ClearMVCCRangeKey() call through to it. It also takes the opportunity
to consolidate the put logic in a similar fashion, by calling through to
the engine method.

Release note: None

storage: improve ClearRangeWithHeuristic for range keys

This patch improves ClearRangeWithHeuristic() to take individual,
optional thresholds for point and/or range keys, controlling when to use
RANGEDEL or RANGEKEYDEL instead of individual tombstones. The Raft
replica clearing code has been adapted to take advantage of this, and in
particular, will no longer drop Pebble range key tombstones unless there
are actual range keys present.

Release note: None

kvserver/batcheval: improve range key clears in ClearRange

This patch improves range key clearing in ClearRange, by using
ClearRangeWithHeuristic to only write Pebble range key tombstones if
there are any range keys present.

Release note: None

storage: add point/range key params to Writer.ClearMVCCRange

This patch adds point/range key parameters to Writer.ClearMVCCRange()
controlling whether to clear the key type. This can be used by callers
to avoid dropping unnecessary tombstones, particularly across range key
spans that are expected to be empty most of the time.

Release note: None

@erikgrinaker erikgrinaker requested review from a team as code owners July 30, 2022 15:54
@erikgrinaker erikgrinaker self-assigned this Jul 30, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

The iffiest bit is the unconditional Pebble range tombstones at the bottom of Raft snapshot SSTs, both because they're unconditional, and also because they will be fairly common. @jbowens Do you consider this problematic? Specifically, this bit here:

I think we'll need cockroachdb/pebble#1839 to not tank iteration perf, but it should be a pretty straightforward change. Once we have that, I don't think it's any worse than the existing state where we include a point key rangedel across the replica's bounds.

Both the point and range tombstones in the snapshot SST are somewhat problematic in that they are wide, and they're included in a large SST that is expensive to re-compact. The artificially extended width of the sstable can be a problem, because it forces ingested sstables that overlap with the tombstone but not the points into higher levels of the LSM. IIRC, there have been attempts to remove the range tombstone from the snapshot SST but they've run into gotchas. It's possible we could mitigate this at the Pebble-level too: cockroachdb/pebble#1838 (comment)

Reviewed 17 of 17 files at r1, 12 of 12 files at r2, 8 of 8 files at r3, 8 of 8 files at r4, 4 of 4 files at r5, 3 of 3 files at r6, 11 of 11 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @pavelkalinnikov)


-- commits line 62 at r5:
nit: maybe we should say 'range key tombstones' instances where we mean specifically the range-key kind.

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)


-- commits line 62 at r5:

Previously, jbowens (Jackson Owens) wrote…

nit: maybe we should say 'range key tombstones' instances where we mean specifically the range-key kind.

But don't we have both "point" and range tombstones for rangekeys too? A rangekeyunset is a "point" tombstone for range keys, no?

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @pavelkalinnikov)


-- commits line 62 at r5:

Previously, erikgrinaker (Erik Grinaker) wrote…

But don't we have both "point" and range tombstones for rangekeys too? A rangekeyunset is a "point" tombstone for range keys, no?

Not in the language we use within Pebble, but I see the similarity. Within Pebble's comments and documentation anything related to the range key keyspace is 'range key ___', and both RANGEKEYDEL and RANGEKEYUNSET are tombstones within the range key keyspace.

My main objection to "Pebble range tombstone" is that historically and within Pebble, this means a RANGEDEL. I'm somewhat regretting not naming range keys 'span keys' or something else appreciably different.

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)


-- commits line 62 at r5:
Right, but in this case, "Pebble range tombstone" does mean a RANGEDEL:

will no longer drop Pebble range tombstones across the range key span unless there are actual range keys present..

Previously, we would drop a RANGEDEL across the range key space regardless of whether there were range keys present.

I've tried to be reasonably consistent with only using "Pebble range tombstones" about RANGEDELS.

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @pavelkalinnikov)


-- commits line 62 at r5:

Previously, erikgrinaker (Erik Grinaker) wrote…

Right, but in this case, "Pebble range tombstone" does mean a RANGEDEL:

will no longer drop Pebble range tombstones across the range key span unless there are actual range keys present..

Previously, we would drop a RANGEDEL across the range key space regardless of whether there were range keys present.

I've tried to be reasonably consistent with only using "Pebble range tombstones" about RANGEDELS.

I'm still confused. "Pebble range tombstones" have no affect on range keys, so they shouldn't be conditional on whether there exists a range key present, right?

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)


-- commits line 62 at r5:

Previously, jbowens (Jackson Owens) wrote…

I'm still confused. "Pebble range tombstones" have no affect on range keys, so they shouldn't be conditional on whether there exists a range key present, right?

Ah sorry, I mixed up RANGEDEL and RANGEKEYDEL. I see what you're saying, will try to reword.

This patch adds boolean parameters for point/range keys to
`Writer.ClearRawRange()`. When true, it will unconditionally write
Pebble range tombstones across the point/range key spans. This gives the
caller better control over when to write tombstones, to avoid writing
them unnecessarily.

This is a behavioral change, because the previous `Engine` and `Batch`
implementations would only drop Pebble range key tombstones if there
were range keys present. Later commits will introduce higher-level
functionality to avoid writing these tombstones unnecessarily and adapt
callers.

Release note: None
Previously, `Writer.ClearMVCCIteratorRange` would use a Pebble
`RANGEKEYDEL` to clear all range keys within the span. This went against
the intent of the method, which is to iterate across the the span and
clear individual range keys. This patch changes is to do the latter.

Parameters are also added to control whether to clear point and/or range
keys.

Release note: None
This patch removes `Writer.ClearAllRangeKeys()`, which was used to clear
all range keys in a span using a single `RANGEKEYDEL` tombstone.
`ClearRawRange` should be used instead.

This also removes some clumsy `Engine` logic that attempted to detect
and avoid writing range key tombstones if there were no range keys,
which required additional metadata tracking in batches and such. The
responsibility for this is instead placed on callers. It was never
possible to handle this sufficiently inside the writers anyway, because
of e.g. `SSTWriter` which does not have access to a reader.

Release note: None
This patch adds `Writer.ClearEngineRangeKey()`, and makes
`ClearMVCCRangeKey()` call through to it. It also takes the opportunity
to consolidate the put logic in a similar fashion, by calling through to
the engine method.

Release note: None
This patch improves `ClearRangeWithHeuristic()` to take individual,
optional thresholds for point and/or range keys, controlling when to use
`RANGEDEL` or `RANGEKEYDEL` instead of individual tombstones. The Raft
replica clearing code has been adapted to take advantage of this, and in
particular, will no longer drop Pebble range key tombstones unless there
are actual range keys present.

Release note: None
This patch improves range key clearing in `ClearRange`, by using
`ClearRangeWithHeuristic` to only write Pebble range key tombstones if
there are any range keys present.

Release note: None
This patch adds point/range key parameters to `Writer.ClearMVCCRange()`
controlling whether to clear the key type. This can be used by callers
to avoid dropping unnecessary tombstones, particularly across range key
spans that are expected to be empty most of the time.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-clear-redux branch from 3d6b221 to 99c3057 Compare August 1, 2022 18:33
Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need cockroachdb/pebble#1839 to not tank iteration perf, but it should be a pretty straightforward change.
I think we'll need cockroachdb/pebble#1839 to not tank iteration perf, but it should be a pretty straightforward change.

I'm kind of surprised that we're not already seeing this show up in roachtest benchmarks. Maybe we don't end up actually sending any snapshots that often. In any case, yeah, that seems worthwhile.

Thanks for the background here, sounds like these range deletions aren't an immediate problem then.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @pavelkalinnikov)


-- commits line 62 at r5:

Previously, erikgrinaker (Erik Grinaker) wrote…

Ah sorry, I mixed up RANGEDEL and RANGEKEYDEL. I see what you're saying, will try to reword.

Updated the commit messages and comments. I still refer to RANGEDEL and RANGEKEYDEL collectively as Pebble range tombstones in a few places where it would be awkward or overly verbose to be more specific -- I think it's clear enough from the context.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=jbowens

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 1, 2022

Build succeeded:

@craig craig bot merged commit 3be0e86 into cockroachdb:master Aug 1, 2022
@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-clear-redux branch August 5, 2022 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: revisit Writer range clear API and heuristics

3 participants