storage: make subsumption atomic with merge commit application#28684
storage: make subsumption atomic with merge commit application#28684craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
tbg
left a comment
There was a problem hiding this comment.
Reviewed 10 of 10 files at r1, 8 of 8 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 809 at r2 (raw file):
// Suggest the cleared range to the compactor queue. r.store.compactor.Suggest(ctx, storagebase.SuggestedCompaction{
This should happen later, after you've actually committed it, in postDestroyRaftMuLocked (or any later point in time).
pkg/storage/replica.go, line 872 at r2 (raw file):
} ms := r.GetMVCCStats()
Would feel better if you grabbed this before actually destroying the replica. It's unclear what answer you're getting right now.
bdarnell
left a comment
There was a problem hiding this comment.
s/subsumation/subsumption/
Reviewed 2 of 10 files at r1, 8 of 8 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2529 at r2 (raw file):
// RemoveOptions bundles boolean parameters for Store.RemoveReplica. type RemoveOptions struct { DestroyReplica bool
I prefer the old name. The Replica object is effectively destroyed either way; the main effect of destroyRaftMuLocked is to delete the data.
benesch
left a comment
There was a problem hiding this comment.
D'oh, thanks. I knew "subsumation" sounded funny.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 809 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
This should happen later, after you've actually committed it, in
postDestroyRaftMuLocked(or any later point in time).
Done.
pkg/storage/replica.go, line 872 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Would feel better if you grabbed this before actually destroying the replica. It's unclear what answer you're getting right now.
Done.
pkg/storage/store.go, line 2529 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I prefer the old name. The
Replicaobject is effectively destroyed either way; the main effect of destroyRaftMuLocked is to delete the data.
Done.
During a merge, the subsumed range logically ceases to exist at the moment that the merge transaction commit applies. It's important that we remove the subsumed range from disk in the same batch as the merge commit to protect against an ill-timed crash. Tease apart Replica.destroyRaftMuLocked to make this possible. Release note: None
|
bors r=bdarnell,tschottdorf |
Build failed |
|
Examples-ORMs flake. bors r=bdarnell,tschottdorf |
28684: storage: make subsumption atomic with merge commit application r=bdarnell,tschottdorf a=benesch ~Please review the first commit separately in #28661.~ During a merge, the subsumed range logically ceases to exist at the moment that the merge transaction commit applies. It's important that we remove the subsumed range from disk in the same batch as the merge commit to protect against an ill-timed crash. Tease apart Replica.destroyRaftMuLocked to make this possible. Release note: None 28707: server: don't drain when decommissioning r=nvanbenschoten a=tschottdorf Prior to this commit, a server which enters decommissioning would automatically drain. This wasn't a great idea since it meant that pointing the decommissioning command at a set of nodes required for the cluster to work would brick that cluster, and would be difficult to recover from (since the draining state is picked up by the nodes when they start). Instead, decouple draining from decommissioning. Decommissioning simply tells the allocator to move data off that node; when the node gets shut down cleanly (the operator's responsiblity), it will drain. The resulting behavior when trying to decommission a too-large set of nodes is now that the decommission command will simply not finish. For example, starting a three node cluster and decommissioning all nodes will simply hang (though the nodes will be marked as decommissioning). Add three more nodes to the cluster and replicas will move over to the newly added nodes, and the decommissioning command will finish. As a bonus, recommissioning a node now doesn't require the target node to restart. As a second bonus, the decommissioning acceptance test now takes around 60% of the previous time (~45s down from 70+). One remaining caveat is that users may forget that they attempted to decommission nodes. We need to check that we prominently alert in the UI when nodes are decommissioning. This isn't a new problem. Fixes #27444. Fixes #27025. Release note (bug fix): decommissioning multiple nodes is now possible without posing a risk to cluster health. Recommissioning a node does no longer require a restart of the target nodes to take effect. Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build succeeded |
Please review the first commit separately in #28661.During a merge, the subsumed range logically ceases to exist at the
moment that the merge transaction commit applies. It's important that we
remove the subsumed range from disk in the same batch as the merge
commit to protect against an ill-timed crash. Tease apart
Replica.destroyRaftMuLocked to make this possible.
Release note: None