storage: add MVCCStats for range keys#78085
Conversation
cfb0b56 to
33c04d4
Compare
b050b4a to
bb5aa54
Compare
cbba19c to
c75f6ff
Compare
bb5aa54 to
ba8eb78
Compare
aliher1911
left a comment
There was a problem hiding this comment.
Some nits and questions mostly otherwise looks sensible.
ba8eb78 to
34e6ab0
Compare
c75f6ff to
1d48763
Compare
34e6ab0 to
4798ae2
Compare
nicktrav
left a comment
There was a problem hiding this comment.
Reviewed 61 of 61 files at r8, 6 of 6 files at r9, 19 of 19 files at r10, 16 of 16 files at r11, 2 of 2 files at r12, 2 of 2 files at r13, 2 of 2 files at r14, 6 of 6 files at r15, 32 of 32 files at r16, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_delete_range.go, line 119 at r16 (raw file):
startKey, endKey, rangeStart, rangeEnd roachpb.Key, ) (roachpb.Key, roachpb.Key) { const prevKeyLength = 8192
Is there any significance to this const? Worth a comment?
2097ac5 to
3ad25e4
Compare
1d48763 to
f6d2aca
Compare
3ad25e4 to
6a1756c
Compare
0a1a054 to
acc5569
Compare
|
Allright, this bad boy is unblocked and ready for final review. |
jbowens
left a comment
There was a problem hiding this comment.
flushing some questions
Reviewed 13 of 61 files at r8, 1 of 16 files at r11, 1 of 27 files at r55, 10 of 44 files at r56, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/kv/kvserver/replica_raft.go line 2152 at r56 (raw file):
prefixEnd := prefix.PrefixEnd() iter := reader.NewMVCCIterator(storage.MVCCKeyIterKind, storage.IterOptions{ KeyTypes: storage.IterKeyTypePointsAndRanges,
there won't be mvcc delete ranges in the raft log, right?
pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1316 at r56 (raw file):
// Calculate the RHS adjustment, which turns out to be equivalent to the stats // contribution of the range key fragmentation. The naïve calculation would be // rhs.EncodedSize() - (keyLen(rhs.EndKey) - keyLen(lhs.EndKey))
should this say
rhs.EncodedSize() - (keyLen(rhs.StartKey) - keyLen(lhs.EndKey))
?
pkg/kv/kvserver/batcheval/cmd_truncate_log.go line 124 at r56 (raw file):
LowerBound: start, UpperBound: end, })
same question about raft log having range keys
pkg/kv/kvserver/batcheval/split_stats_helper.go line 34 at r56 (raw file):
// AbsPostSplit{Left,Right}Fn, and so we don't expect ContainsEstimates to be // set in them. The choice of which side to scan is controlled by ScanRightFirst. // - DeltaRangeKey: the stats delta that must be added to the non-computed
nit: this is referenced as DeltaRangeKeyRight down below, but is defined as DeltaRangeKey here. should they be unified?
pkg/storage/engine.go line 268 at r56 (raw file):
// This isn't currently done in order to do spanset assertions on it, but this // could be better solved by checking the iterator bounds in NewMVCCIterator // and requiring callers to set them appropriately.
🎉 This will allow us to drop an end-key comparison in ComputeStatsForRange: #41899
pkg/storage/mvcc.go line 902 at r56 (raw file):
which is not affected by MVCC range tombstones
can you explain this? A MVCC Range tombstone at t2 that covers the key k@t1 does cause k@t1 to contribute to GCByesAge from t2 on, just like a point tombstone would, right?
is this saying that a MVCC range tombstone that's shadowed at prefix k doesn't itself apply to prefix k's GCBytesAge?
do MVCC range tombstones themselves that are shadowed ever contribute to GCBytesAge? Eg, a MVCCDeleteRange([a,z)@t10) and a MVCCDeleteRange([a,z)@t3)?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/kv/kvserver/replica_raft.go line 2152 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
there won't be mvcc delete ranges in the raft log, right?
No, shouldn't be. I figured we might as well enable range keys for all stats calculations, since I assumed the overall overhead would be negligible and better safe than sorry, but we can be more selective about it too.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1316 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
should this say
rhs.EncodedSize() - (keyLen(rhs.StartKey) - keyLen(lhs.EndKey))?
No -- that would just be 0, because rhs.StartKey == lhs.EndKey after the split.
What we're trying to adjust for here is really the change in the LHS' end key, from before the split to after the split. Before the split, the range key's end key way rhs.EndKey, while post-split it's lhs.EndKey. But as luck would have it, we can express rhs.EndKey in terms of rhs.StartKey via EncodedSize(), eliminating the dependence on rhs.EndKey.
pkg/kv/kvserver/batcheval/split_stats_helper.go line 34 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: this is referenced as
DeltaRangeKeyRightdown below, but is defined asDeltaRangeKeyhere. should they be unified?
Ah, thanks. Yeah, it used to be the case that we always scanned the LHS and then used that to compute the RHS, but some time ago we made it possible to scan either side (since the RHS would often be empty). Forgot to update the names everywhere.
pkg/storage/engine.go line 268 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
🎉 This will allow us to drop an end-key comparison in
ComputeStatsForRange: #41899
Ah, nice -- definitely seems worthwhile.
pkg/storage/mvcc.go line 902 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
which is not affected by MVCC range tombstones
can you explain this? A MVCC Range tombstone at
t2that covers the keyk@t1does causek@t1to contribute toGCByesAgefromt2on, just like a point tombstone would, right?is this saying that a MVCC range tombstone that's shadowed at prefix
kdoesn't itself apply to prefixk'sGCBytesAge?do MVCC range tombstones themselves that are shadowed ever contribute to
GCBytesAge? Eg, aMVCCDeleteRange([a,z)@t10)and aMVCCDeleteRange([a,z)@t3)?
This really only comes up when there are multiple tombstones above a value. With point keys, the prefix contribution to GCBytesAge only starts at the last tombstone. So with a non-tombstone value k@1 with tombstones at k@2, k@3, the GCBytesAge contribution of the prefix starts accumulating from k@3, since GCing below that won't affect the prefix contribution to KeyBytes. However, if we instead had range tombstones [a-z)@2 and [a-z)@3 then the point key ceases to exist at @2, and starts accumulating GCBytesAge from there, because if we GCed up to @2 then we would remove all k point keys, so it would no longer contribute to KeyBytes.
Range keys contribute to GCBytesAge too, but only for the range key as a whole, and per fragment stack. This works in much the same way as point keys: the prefix/key contribution to GCBytesAge begins at the latest range key in the stack, which is where the range key ceases to exist (if it is a range tombstone, which it currently always is). The key contribution of range keys and point keys to stats are independent (they contribute to KeyBytes and RangeKeyBytes separately), and thus the key's contribution to GCBytesAge is also mostly independent, except for range tombstones determining the visibility of point keys.
acc5569 to
1293722
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewed 1 of 27 files at r55, 27 of 44 files at r56, 1 of 1 files at r57, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/storage/mvcc.go line 902 at r56 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This really only comes up when there are multiple tombstones above a value. With point keys, the prefix contribution to
GCBytesAgeonly starts at the last tombstone. So with a non-tombstone valuek@1with tombstones atk@2,k@3, theGCBytesAgecontribution of the prefix starts accumulating fromk@3, since GCing below that won't affect the prefix contribution toKeyBytes. However, if we instead had range tombstones[a-z)@2and[a-z)@3then the point key ceases to exist at@2, and starts accumulatingGCBytesAgefrom there, because if we GCed up to@2then we would remove allkpoint keys, so it would no longer contribute toKeyBytes.Range keys contribute to
GCBytesAgetoo, but only for the range key as a whole, and per fragment stack. This works in much the same way as point keys: the prefix/key contribution toGCBytesAgebegins at the latest range key in the stack, which is where the range key ceases to exist (if it is a range tombstone, which it currently always is). The key contribution of range keys and point keys to stats are independent (they contribute toKeyBytesandRangeKeyBytesseparately), and thus the key's contribution toGCBytesAgeis also mostly independent, except for range tombstones determining the visibility of point keys.
Ah, thanks, that makes a lot of sense.
pkg/storage/mvcc.go line 2686 at r57 (raw file):
case 1: // fragment fragmentRangeKeys(iter.RangeKeys(), startKey) case 0: // merge
To avoid the duplication of this defragmenting logic, could we create an iterator that reads through a batch containing the new range-key iterator? If the SeekLT(startKey) lands on a range key whose end boundary extends beyond startKey, the addition of the new range key merged two fragments. If the surfaced range key has an end boundary equal to startKey, the addition of the range key fragmented a key at the left bound.
Release note: None
This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None
1293722 to
f86f168
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/storage/mvcc.go line 2686 at r57 (raw file):
Previously, jbowens (Jackson Owens) wrote…
To avoid the duplication of this defragmenting logic, could we create an iterator that reads through a batch containing the new range-key iterator? If the
SeekLT(startKey)lands on a range key whose end boundary extends beyondstartKey, the addition of the new range key merged two fragments. If the surfaced range key has an end boundary equal tostartKey, the addition of the range key fragmented a key at the left bound.
That might work here, but we're going to need similar logic in a bunch of other places where I don't think that's viable, e.g. during AddSSTable, ClearRange, RevertRange, as well as during splits/merges. That said, this code is rather ugly. The plan is to consolidate it with logic elsewhere and clean it up once we've implemented stats handling everywhere at the KV layer. I added a TODO explaining this. Wdyt?
This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Release note: None
f86f168 to
8fe2362
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewed 1 of 37 files at r25, 4 of 60 files at r26, 11 of 31 files at r52, 16 of 27 files at r55, 7 of 44 files at r56.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/storage/mvcc.go line 2686 at r57 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
That might work here, but we're going to need similar logic in a bunch of other places where I don't think that's viable, e.g. during
AddSSTable,ClearRange,RevertRange, as well as during splits/merges. That said, this code is rather ugly. The plan is to consolidate it with logic elsewhere and clean it up once we've implemented stats handling everywhere at the KV layer. I added a TODO explaining this. Wdyt?
Yeah, fine by me. When we get there, we might consider exposing what we need from Pebble if it'll help contain the complexity.
|
TFTR! bors r=jbowens |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker **storage: export some MVCC key encoding functions** Release note: None **roachpb: add `Key.Prevish()` to find a previous key** This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None **storage: add `MVCCStats` for range keys** This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Touches #70412. Release note: None 82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same slice of `RequestUnion` objects between different fetches. It is now extremely easy to do given the recent refactor. We do perform memory accounting for this slice (against a memory account bound to an unlimited memory monitor). Additionally, a similar optimization is applied to how resume requests are populated by the Streamer. Addresses: #82160. Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
|
Build failed (retrying...): |
|
Build failed: |
|
bors retry |
|
Build succeeded: |
storage: export some MVCC key encoding functions
Release note: None
roachpb: add
Key.Prevish()to find a previous keyThis patch adds
Key.Prevish(), which returns a previous key inlexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.
It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of
0xffbytes at the end, sothis returns the nearest previous key right-padded with
0xffup to thegiven length. It is still possible for an infinite number of keys to
exist between
KeyandKey.Prevish(), as keys have unbounded length.Release note: None
storage: add
MVCCStatsfor range keysThis patch adds
MVCCStatstracking for range keys. Four new fields areadded to
MVCCStats:RangeKeyCount: the number of (fragmented) range keys, not countinghistorical versions.
RangeKeyBytes: the logical encoded byte size of all range keys.The latest version contributes the encoded key bounds, and all
versions contribute encoded timestamps. Unlike point keys, which for
historical reasons use a fixed-size timestamp contribution, this uses
the actual variable-length timestamp size.
RangeValCount: the number of (fragmented) range key versions.RangeValBytes: the encoded size of all range key values acrossall versions. The same value can be stored in multiple range keys
due to fragmentation, which will be counted separately. Even though
all range keys are currently MVCC range tombstones with no value, the
MVCCValueHeadercontribution can be non-zero due to e.g. a localtimestamp.
ComputeStatsForRange()has been extended to calculate the abovequantities, and additionally account for range tombstones themselves in
GCBytesAgealong with their effect on point keys. All relevant callsites have been updated to surface range keys for the MVCC iterators
passed to
ComputeStatsForRange().Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.
Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.
Touches #70412.
Release note: None