loqrecovery: use captured meta range content for LOQ plans#94239
loqrecovery: use captured meta range content for LOQ plans#94239craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
973507f to
70e07e9
Compare
809abfb to
5305b54
Compare
c727406 to
36bace0
Compare
36bace0 to
8b8eb85
Compare
pkg/kv/kvserver/loqrecovery/utils.go
Outdated
There was a problem hiding this comment.
We allow use of Go generics now (#93735). This seems like a good opportunity to try it out with some generic map intersect/diff functions (or just use a library which provides it). Doesn't have to happen in this PR though, or at all.
pkg/cli/debug.go
Outdated
There was a problem hiding this comment.
Should we just get rid of dead-store-ids, to avoid the complexity of supporting both?
There was a problem hiding this comment.
I think we should get rid of dead store ids in 23.2. Just to avoid any confusion on mixed cluster where old option which was the primary means disappears after upgrade. It should be less of an issue with 23.1 to 23.2 where we don't expect people to use dead-store-ids anymore.
8b8eb85 to
0618f19
Compare
9b527a8 to
f21dbac
Compare
Previously loss of quorum recovery planner was using local replica info collected from all nodes to find source of truth for replicas that lost quorum. With online approach local info snapshots don't have atomicity. This could cause planner to fail if available replicas are caught in different states on different nodes. This commit adds alternative planning approach when range metadata is available. Instead of fixing individual replicas that can't make progress it finds ranges that can't make progress from metadata using descriptors and updates their replicas to recover from loss of quorum. This commit also adds replica collection stage as a part of make-plan command itself. To invoke collection from a cluster instead of using files one needs to provide --host and other standard cluster connection related flags (--cert-dir, --insecure etc.) as appropriate. Release note: None
f21dbac to
b5d5fc3
Compare
|
bors r=erikgrinaker |
|
Build failed (retrying...): |
|
Build succeeded: |
Note: only last commit belongs to this PR. Will update description once #93157 lands.
Previously loss of quorum recovery planner was using local replica info collected from all nodes to find source of truth for replicas that lost quorum.
With online approach local info snapshots don't have atomicity. This could cause planner to fail if available replicas are caught in different states on different nodes.
This commit adds alternative planning approach when range metadata is available. Instead of fixing individual replicas that can't make progress it finds ranges that can't make progress from metadata using descriptors and updates their replicas to recover from loss of quorum.
This commit also adds replica collection stage as a part of make-plan command itself. To invoke collection from a cluster instead of using files one needs to provide --host and other standard cluster connection related flags (--cert-dir, --insecure etc.) as appropriate.
Example command output for a local cluster with 3 out of 5 nodes surrvivng looks like:
Release note: None
Fixes: #93038
Fixes: #93046