storage: remove preemptive snapshots#44615
Conversation
|
@tbg here's a prototype of removing preemptive snapshots. It still needs testing. One thing I tried to disambiguate was the word "initialized" as it applies to the I'd love to hear your thoughts on further improvements. Thanks! |
pkg/storage/replica_command.go
Outdated
| // TODO(ajwerner): Should a recipient be able to decline a LEARNER snapshot? | ||
| // It seems like yes. The recipient may know that it does not want to get | ||
| // added to a range. Confer with danhhz/tbg/darin about the implications of | ||
| // making this true for SnapshotRequest_LEARNER. |
There was a problem hiding this comment.
This was just discussed on #44247 (comment)
TLDR: when you refuse the learner, you've already added it to the desc, so we need to be careful. We don't want to roll back replication changes en masse which is what would happen here.
| return false, errors.Errorf("found uninitialized RangeDescriptor: %+v", desc) | ||
| } | ||
| replicaDesc, found := desc.GetReplicaDescriptor(s.StoreID()) | ||
| if !found { |
There was a problem hiding this comment.
We also get replicas like that when a replica catches up across a change replicas trigger that removes it. When you made these replicas GC themselves proactively when they hit the trigger, you did so atomically with the application of the CRT raft command, right? In other words, in this branch, are we really 100% looking at a preemptive snap? I suppose at the very least we might be looking at a replica that did catch up across the trigger before we had code that insta-gc'ed, but it seems fine to treat that as a preemptive snapshot and nuke it (which would happen to it eventually anyway, since it's not in the desc any more).
pkg/storage/store_create_replica.go
Outdated
| // If this is intended for replicaID 0 then it's either a preemptive | ||
| // snapshot or a split/merge lock in which case we'll let it go through. | ||
| if replicaID == 0 { | ||
| // TODO(ajwerner): fix this case. |
There was a problem hiding this comment.
I have this all paged out again, but is the split/merge lock the last remaining use of replicaID zero? I'd be so happy to not have that be a thing. (not in this PR)
There was a problem hiding this comment.
Even better! I removed all uses of ReplicaID 0 in this PR already, just forgot to come back to this conditional after I did that. This can be a fatal now.
|
Only a very partial review, I will have to give this a look with fresh eyes next week. Excited that this is happening! |
1d0bd8a to
444b9a7
Compare
|
This is RFAL. I'll put up a second PR to update the grafana dashboards. |
55ef87f to
df22915
Compare
tbg
left a comment
There was a problem hiding this comment.
: very very nice. I'm so happy you chewed through and removed all of that detritus. I feel a lot better about all of the code you touched.
Reviewed 14 of 25 files at r1, 14 of 14 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/replica_init.go, line 278 at r2 (raw file):
// batch. We could avoid setting the local descriptor in this case. // 2) When the DisableEagerReplicaRemoval testing knob is enabled. We // could
missing some words
I agree that it would be nice to be more coherent here. Not necessarily for this PR, looks like a decent amount of individual cleanup is needed.
pkg/storage/replica_raft.go, line 1538 at r2 (raw file):
ctx context.Context, split *roachpb.SplitTrigger, ) (func(), error) { // We pass a 0 replicaID because we want to lock the RHS even if we know
Comment stale.
pkg/storage/split_queue_test.go, line 77 at r2 (raw file):
// ensures that the store won't be mucking with our replica concurrently // during testing (e.g. via the system config gossip update). copy := *tc.repl.Desc()
Shadowing a built-in? I usually go with cpy in that case.
pkg/storage/store.go, line 1342 at r2 (raw file):
// range which has processed a raft command to remove itself (which is // possible prior to 19.2 or if the DisableEagerReplicaRemoval is // enabled) and has not yet been removed by the replica gc queue.
weird spaces
pkg/storage/store_split.go, line 51 at r2 (raw file):
... if it hasn't been removed in the meantime (handled below).
pkg/storage/store_split.go, line 53 at r2 (raw file):
// acquired) in Replica.acquireSplitLock. It must be present here. rightRepl, err := r.store.GetReplica(split.RightDesc.RangeID) if err != nil && !roachpb.IsRangeNotFoundError(err) {
I'd rewrite this for clarity as
rightRepl, err := r.store.GetReplica(split.RightDesc.RangeID)
if roachpb.IsRangeNotFoundError(err) {
// The right hand side we were planning to populate has already been removed.
// We handle this below.
rightRepl = nil
}
if err != nil {
log.Fatalf(ctx, errors.Wrap(err, "failed to get RHS replica"))
}pkg/storage/store_split.go, line 62 at r2 (raw file):
// We're in the rare case where we know that the RHS has been removed // and re-added with a higher replica ID (and then maybe removed again). // We know we've never processed a snapshot for the right range because up
Worth an assertion (if rightRng isn't nil).
pkg/storage/store_split.go, line 63 at r2 (raw file):
// and re-added with a higher replica ID (and then maybe removed again). // We know we've never processed a snapshot for the right range because up // to this point it would overlap with the left and ranges cannot move
spaces
pkg/storage/store_split.go, line 66 at r2 (raw file):
While you're here, rework this comment - it's unclear what it's about until one hits this paragraph. Something like
// We're in the rare case where we know that the RHS has been removed
// and re-added with a higher replica ID (and then maybe removed again).
//
// To apply the split, we need to "throw away" the data that would belong to
// the RHS, i.e. we clear the user data the RHS would have inherited from the
// LHS due to the split and additionally clear all of the range ID local state
// that the split trigger writes into the RHS.
//
// We know we've never processed a snapshot for the right range because the
// LHS prevents any incoming snapshots until the split has executed (i.e. now).
// It is important to preserve the HardState because we might however have
// already voted at a higher term. In general this shouldn't happen because
// we add learners and then promote them only after we snapshot but we're
// going to be extra careful in case future versions of cockroach somehow
// promote replicas without ensuring that a snapshot has been received.
//
// Rather than specifically deleting around the data we want to preserve
// we read the HardState to preserve it, clear everything and write back
// the HardState and tombstone. Note that we only do this if rightRepl
// exists; if it doesn't, there's no Raft state to massage (when rightRepl
// was removed, a tombstone was written instead).
pkg/storage/store_split.go, line 111 at r2 (raw file):
// split with the Store. Requires that Replica.raftMu is held. // // TODO(tschottdorf): Want to merge this with SplitRange, but some legacy
Remove this TODO.
pkg/storage/store_split.go, line 116 at r2 (raw file):
ctx context.Context, deltaMS enginepb.MVCCStats, split *roachpb.SplitTrigger, r *Replica, ) { rightRngOrNil := prepareRightReplicaForSplit(ctx, split, r)
Mention when it's nil.
pkg/storage/store_split.go, line 163 at r2 (raw file):
// If the RHS replica at the point of the split was known to be removed when // during the application of the split then we may not find it here. That's // fine, carry on.
// See also:
_, _= r.acquireSplitLock, r.splitPreApplyeb0e8a3 to
0dd71ca
Compare
ajwerner
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replica_command.go, line 1878 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This was just discussed on #44247 (comment)
TLDR: when you refuse the learner, you've already added it to the desc, so we need to be careful. We don't want to roll back replication changes en masse which is what would happen here.
Ack.
pkg/storage/replica_init.go, line 278 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
missing some words
I agree that it would be nice to be more coherent here. Not necessarily for this PR, looks like a decent amount of individual cleanup is needed.
Done.
pkg/storage/replica_raft.go, line 1538 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Comment stale.
Done.
pkg/storage/split_queue_test.go, line 77 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Shadowing a built-in? I usually go with
cpyin that case.
Done. This was ancient code.
pkg/storage/store.go, line 1342 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
weird spaces
Done.
pkg/storage/store_split.go, line 51 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
... if it hasn't been removed in the meantime (handled below).
Done.
pkg/storage/store_split.go, line 53 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd rewrite this for clarity as
rightRepl, err := r.store.GetReplica(split.RightDesc.RangeID) if roachpb.IsRangeNotFoundError(err) { // The right hand side we were planning to populate has already been removed. // We handle this below. rightRepl = nil } if err != nil { log.Fatalf(ctx, errors.Wrap(err, "failed to get RHS replica")) }
Done.
pkg/storage/store_split.go, line 62 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Worth an assertion (if rightRng isn't nil).
Done.
pkg/storage/store_split.go, line 63 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
spaces
Done.
pkg/storage/store_split.go, line 66 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
While you're here, rework this comment - it's unclear what it's about until one hits this paragraph. Something like
// We're in the rare case where we know that the RHS has been removed
// and re-added with a higher replica ID (and then maybe removed again).
//
// To apply the split, we need to "throw away" the data that would belong to
// the RHS, i.e. we clear the user data the RHS would have inherited from the
// LHS due to the split and additionally clear all of the range ID local state
// that the split trigger writes into the RHS.
//
// We know we've never processed a snapshot for the right range because the
// LHS prevents any incoming snapshots until the split has executed (i.e. now).
// It is important to preserve the HardState because we might however have
// already voted at a higher term. In general this shouldn't happen because
// we add learners and then promote them only after we snapshot but we're
// going to be extra careful in case future versions of cockroach somehow
// promote replicas without ensuring that a snapshot has been received.
//
// Rather than specifically deleting around the data we want to preserve
// we read the HardState to preserve it, clear everything and write back
// the HardState and tombstone. Note that we only do this if rightRepl
// exists; if it doesn't, there's no Raft state to massage (when rightRepl
// was removed, a tombstone was written instead).
Done.
pkg/storage/store_split.go, line 111 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Remove this TODO.
Done.
pkg/storage/store_split.go, line 116 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mention when it's nil.
Done.
pkg/storage/store_split.go, line 163 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// See also: _, _= r.acquireSplitLock, r.splitPreApply </blockquote></details> Done. <!-- Sent from Reviewable.io -->
0dd71ca to
454296b
Compare
tbg
left a comment
There was a problem hiding this comment.
Reviewed 11 of 11 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/store_split.go, line 91 at r3 (raw file):
// snapshot to initialize this newer RHS. if rightRepl.IsInitialized() { log.Fatalf(ctx, "cannot delete ")
not that we're going to ever see this fire 🤞 but something's missing here
This commit adds a migration to remove any remaining on-disk pre-emptive snapshots and then excises the concept from the codebase. With this commit we gain the invariant that a Replica always has a non-zero replicaID that never changes. There's some cruft in replica_init that I'd love to clean up but don't have a clear vision of how. Release note: None
454296b to
b2850e1
Compare
ajwerner
left a comment
There was a problem hiding this comment.
TFTR!
bors r=tbg
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/store_split.go, line 91 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
not that we're going to ever see this fire 🤞 but something's missing here
Done.
44564: colexec: clean up memory infra on panic in NewColOperator r=yuzefovich a=yuzefovich It is possible that an unexpected error occurs during NewColOperator call after we have created some memory monitoring infrastructure. In order to correctly clean up the infra in such scenario we will now have a separate panic catcher that closes the infra within NewColOperator. We will also now clean up the infra when an error occurs. Release note: None 44615: storage: remove preemptive snapshots r=tbg a=ajwerner This commit adds a migration to remove any remaining on-disk pre-emptive snapshots and then excises the concept from the codebase. There's some cruft in replica_init that I'd love to clean up but don't have a clear vision of how. Release note: None 44660: colexec: reject CASE operator with Bytes output type r=yuzefovich a=yuzefovich Currently, there is a contradiction between the way CASE operator works (which populates its output in arbitrary order) and the flat bytes implementation of Bytes type (which prohibits sets in arbitrary order), so we reject such scenario to fall back to row-by-row engine. Fixes: #44624. Release note: None 44687: cli: add system.namespace_deprecated to debug zip r=knz a=otan As we are migrating system.namespace from system.namespace_deprecated in 20.2, it may be useful to have a zip of this old table, in case it comes in handy for debugging in future. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Build succeeded |
We recently saw an abandoned "leaked" uninitialized replica on a 19.2 node. The unfortunate truth is in that version we aren't sure of its replica ID and thus can't remove it. In 20.1 and later we can remove uninitialized replicas (see cockroachdb#44615). Much of the time, uninitialized replicas are removed either because the range is re-added to the store at a higher replica ID or because the node receives a ReplicaTooOldError which leads to the removal. The replicaGCQueue is the obvious tool to remove "leaked" uninitialized replicas. This commit has to do some work to allow that queue the opportunity to interact with uninitialized replicas. There are at least two cases where such replicas can certainly come to exist The first is when a split is rapidly followed by a removal of both the LHS and RHS but the LHS never processes the split because it receives a ReplicaTooOldError. The second is when a snapshot to add a learner fails but the learner never discovers that it has been removed. Release note (bug fix): Garbage collect abandoned, uninitialized replicas.
We recently saw an abandoned "leaked" uninitialized replica on a 19.2 node. The unfortunate truth is in that version we aren't sure of its replica ID and thus can't remove it. In 20.1 and later we can remove uninitialized replicas (see cockroachdb#44615). Much of the time, uninitialized replicas are removed either because the range is re-added to the store at a higher replica ID or because the node receives a ReplicaTooOldError which leads to the removal. The replicaGCQueue is the obvious tool to remove "leaked" uninitialized replicas. This commit has to do some work to allow that queue the opportunity to interact with uninitialized replicas. There are at least two cases where such replicas can certainly come to exist The first is when a split is rapidly followed by a removal of both the LHS and RHS but the LHS never processes the split because it receives a ReplicaTooOldError. The second is when a snapshot to add a learner fails but the learner never discovers that it has been removed. Release note (bug fix): Garbage collect abandoned, uninitialized replicas.
This commit adds a migration to remove any remaining on-disk pre-emptive
snapshots and then excises the concept from the codebase.
There's some cruft in replica_init that I'd love to clean up but don't
have a clear vision of how.
Release note: None