loqrecovery: add replica info collection to admin server#93157
loqrecovery: add replica info collection to admin server#93157craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
bc3f976 to
c321b6e
Compare
4a20157 to
de43c15
Compare
f1990db to
a8c4c54
Compare
3eda780 to
cbc19c6
Compare
d1c0707 to
3284138
Compare
0168d11 to
c5781fd
Compare
c5781fd to
96ab655
Compare
dae4d94 to
94bfa12
Compare
| // but in practice info collected from remote cluster will contain info per | ||
| // node. In case of offline collection, replicas will belong to all stores | ||
| // provided to the command regardless of owning node. | ||
| message NodeReplicaInfo { |
There was a problem hiding this comment.
We should probably add in a node_id field here now as well.
There was a problem hiding this comment.
That would not work very well with offline collection where we are collecting all stores. There's no restrictions on it being a single node as the comment says. We were advertising this as a way to mount stores from multiple containers into single host and them running collection. I'm not comfortable of changing old behaviours in small ways straight after introducing them.
There was a problem hiding this comment.
I see. I suppose we could still group by node ID in those cases (each store must necessarily belong to one node), or otherwise explicitly mark the data as belonging to multiple nodes (e.g. with node_id = 0). But I'm also fine with deferring that until we actually need it -- for now, I think the only practical consequence is potential confusion/bugs for developers touching this code.
pkg/kv/kvserver/loqrecovery/utils.go
Outdated
| // version. It checks major and minor versions only. This is simple check for | ||
| // use with transient recovery info files only to prevent incorrect files being | ||
| // used in error. | ||
| func CanUseFileVersion(v roachpb.Version) error { |
There was a problem hiding this comment.
We're checking the file version here, but only for the CLI command used to coordinate the change. Should it be possible to use an N-1 CLI version to recover an N version cluster or a mixed-version cluster? There's currently nothing preventing that, because RPC calls will allow mixing N and N-1 versions. If we allow that (which we currently do) then we have to provide N-1 backwards compatibility with all of the RPC APIs going forwards -- are we prepared to guarantee that? That also implies that we have to allow executing an N-1 plan on a node at version N. In either case, we must check the plan version during plan application.
There was a problem hiding this comment.
My idea is exactly that. For planning phase, we don't want to mix versions of replica infos and use ones that were created by the same version. When applying a plan, we should be able to use plan from previous version for the case of mixed version clusters. So until you finalized your upgrade you should use older version of tools.
There was a problem hiding this comment.
If a node at version N has to remain compatible with a plan generated by a binary at version N-1, why can't the planner at version N remain compatible with data collected by a node at version N-1. Is that harder somehow? If we support running a binary at version N-1 against an unfinalized cluster at version N then the cluster already has to remain compatible both with a collector and planner at version N-1.
We discussed this a bit offline, and concluded that we want to support mixed-version clusters throughout the loss of quorum recovery process in the same way that we support mixed-version clusters elsewhere. However, this needs a bit of care with version checks and detection. We'll remove the mixed-version handling from this PR for now, and address mixed-version concerns throughout the loss of quorum process in a separate PR.
f3c4b31 to
77fbf8f
Compare
knz
left a comment
There was a problem hiding this comment.
I reviewed just the server and cli changes.
Reviewed 2 of 22 files at r5, 7 of 18 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @benbardin, @erikgrinaker, and @miretskiy)
pkg/cli/debug_recover_loss_of_quorum.go line 218 at r6 (raw file):
func runDebugDeadReplicaCollect(cmd *cobra.Command, args []string) error { // We need cancellable context here to obtain grpc client connection. // getAdminClient will refuse otherwise.
I think it's correct to need a cancellable context, but I don't understand this comment. What will getAdminClient refuse?
pkg/cli/debug_recover_loss_of_quorum.go line 236 at r6 (raw file):
if err != nil { return errors.Wrap(err, "failed to retrieve replica info from cluster. "+
remove period at end. As you've written it the result of this would be failed to retrieve replica info from cluster.: rest of the error
Try this instead:
errors.WithHint(errors.Wrap(err, "failed to retrieve replica info from cluster"),
"Check the cluster's health and retry the operation.")pkg/cli/debug_recover_loss_of_quorum.go line 453 at r6 (raw file):
return loqrecoverypb.ClusterReplicaInfo{}, errors.Wrapf(err, "failed to unmarshal replica info from file %q, ensure that replica info file is "+ "generated with the same binary version and file is not corrupted",
Use errors.WithHint.
pkg/cli/debug_recover_loss_of_quorum.go line 458 at r6 (raw file):
if err = loqrecovery.CanUseFileVersion(nodeReplicas.CockroachVersion); err != nil { return loqrecoverypb.ClusterReplicaInfo{}, errors.Wrapf(err, "incorrect data version in file %s", filename)
%q for consistency.
pkg/server/admin.go line 3281 at r6 (raw file):
visitor func(nodeID roachpb.NodeID, client serverpb.AdminClient) error, ) error {
nit: remove empty line.
pkg/server/admin.go line 3298 at r6 (raw file):
continue } err = visitor(nodeID, client)
Maybe log the error if non-nil (even if retried). This will help us troubleshoot retry loops that don't make progress.
erikgrinaker
left a comment
There was a problem hiding this comment.
Approving assuming we strip out the mixed-version stuff for now.
pkg/server/admin.go
Outdated
| for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); { | ||
| log.Infof(ctx, "visiting node n%d, attempt %d", nodeID, r.CurrentAttempt()) | ||
| var client serverpb.AdminClient | ||
| client, err = s.dialNode(ctx, nodeID) |
There was a problem hiding this comment.
This dependency on dialNode() is a bit iffy because it calls through to kvFanoutClient, which in general uses KV to retrieve node information. While dialNode() doesn't currently incur any KV calls, it very well could in the future.
I think we should be explicit about only depending on gossip here, since that's critical for this to continue functioning in broken clusters. We can pull VisitAvailableNodesWithRetry out into a standalone helper function under loqrecovery that takes a gossip client as a parameter, and reimplement dialNode() there (which is trivial). That way we're certain that this never relies on KV. It also removes the reverse dependency between systemAdminServer and loqrecovery.Server. Alternatively we could implement a gossipFanoutClient and pass that in, but that might be a bit overkill.
There was a problem hiding this comment.
Should we ditch all the extra machinery that does heartbeats, connection sharing and everything else and just stick to pure admin connection to do the fan out?
There was a problem hiding this comment.
We can pass in the RPC context too, if we need it. Seems reasonable to use the same machinery that we normally use.
There was a problem hiding this comment.
Alternatively, make it a method on loqrecovery.Server and pass in gossip and the RPC context during construction.
There was a problem hiding this comment.
Yeah, I prefer the last idea of passing at construction as all the fields are static. But I think it makes more sense to make function a closure so that it could always be replaced for test. Otherwise it becomes more and more entangled with other components which is tough to break up for tests.
There was a problem hiding this comment.
Could just use a testing knob as well, and dispatch to that at the start of the call when set.
pkg/kv/kvserver/loqrecovery/utils.go
Outdated
| // version. It checks major and minor versions only. This is simple check for | ||
| // use with transient recovery info files only to prevent incorrect files being | ||
| // used in error. | ||
| func CanUseFileVersion(v roachpb.Version) error { |
There was a problem hiding this comment.
If a node at version N has to remain compatible with a plan generated by a binary at version N-1, why can't the planner at version N remain compatible with data collected by a node at version N-1. Is that harder somehow? If we support running a binary at version N-1 against an unfinalized cluster at version N then the cluster already has to remain compatible both with a collector and planner at version N-1.
We discussed this a bit offline, and concluded that we want to support mixed-version clusters throughout the loss of quorum recovery process in the same way that we support mixed-version clusters elsewhere. However, this needs a bit of care with version checks and detection. We'll remove the mixed-version handling from this PR for now, and address mixed-version concerns throughout the loss of quorum process in a separate PR.
68cb761 to
38ff041
Compare
This commit adds replica info and range descriptor collection from partially available cluster that lost quorum on some ranges. Collection is done using AdminServer calls. Cluster wide call performs fan-out using node info from gossip. Local replica info collection on nodes is done by scanning storages. Release note (cli change): File format used for transient loss of quorum recovery files has changed. It is not possible to use replica info files generated by earlier versions to be used with current and future versions.
38ff041 to
76fc533
Compare
|
bors r=erikgrinaker,knz |
|
Build failed (retrying...): |
|
Build succeeded: |
94239: loqrecovery: use captured meta range content for LOQ plans r=erikgrinaker a=aliher1911 Note: only last commit belongs to this PR. Will update description once #93157 lands. Previously loss of quorum recovery planner was using local replica info collected from all nodes to find source of truth for replicas that lost quorum. With online approach local info snapshots don't have atomicity. This could cause planner to fail if available replicas are caught in different states on different nodes. This commit adds alternative planning approach when range metadata is available. Instead of fixing individual replicas that can't make progress it finds ranges that can't make progress from metadata using descriptors and updates their replicas to recover from loss of quorum. This commit also adds replica collection stage as a part of make-plan command itself. To invoke collection from a cluster instead of using files one needs to provide --host and other standard cluster connection related flags (--cert-dir, --insecure etc.) as appropriate. Example command output for a local cluster with 3 out of 5 nodes surrvivng looks like: ``` ~/tmp ❯❯❯ cockroach debug recover make-plan --insecure --host 127.0.0.1:26257 >recover-plan.json Nodes scanned: 3 Total replicas analyzed: 228 Ranges without quorum: 15 Discarded live replicas: 0 Proposed changes: range r4:/System/tsd updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r80:/Table/106/1 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2]. range r87:/Table/106/1/"paris"/"\xcc\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x00(" updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2]. range r88:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r105:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica (n3,s3):3 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):1,(n4,s4):2]. range r98:/Table/107/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\x03" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r95:/Table/107/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x06" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3]. range r125:/Table/107/1/"washington dc"/"DDDDDDD\x00\x80\x00\x00\x00\x00\x00\x00\x04" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3]. range r115:/Table/108/1/"boston"/"8Q\xeb\x85\x1e\xb8B\x00\x80\x00\x00\x00\x00\x00\x00n" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r104:/Table/108/1/"new york"/"\x1c(\xf5\u008f\\I\x00\x80\x00\x00\x00\x00\x00\x007" updating replica (n2,s2):2 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):3]. range r102:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3]. range r126:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3]. range r86:/Table/108/3 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):3,(n5,s5):2]. range r59:/Table/109/1 updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r65:/Table/111/1 updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. Discovered dead nodes would be marked as decommissioned: n4, n5 Proceed with plan creation [y/N] y Plan created. To stage recovery application in half-online mode invoke: 'cockroach debug recover apply-plan --host=127.0.0.1:26257 --insecure=true <plan file>' Alternatively distribute plan to below nodes and invoke 'debug recover apply-plan --store=<store-dir> <plan file>' on: - node n2, store(s) s2 - node n1, store(s) s1 - node n3, store(s) s3 ``` Release note: None Fixes: #93038 Fixes: #93046 94948: changefeedccl: give notice when metrics_label set without child_metrics r=samiskin a=samiskin Resolves #75682 Surfaces a notice of ``` server.child_metrics.enabled is set to false, metrics will only be published to the '<scope>' label when it is set to true" ``` When child_metrics setting isn't enabled during changefeed creation Release note (enterprise change): Changefeeds created/altered with a metrics_label set while server.child_metrics.enabled is false will now provide the user a notice upon creation. 95009: tree: fix panic when encoding tuple r=rafiss a=rafiss fixes #95008 This adds a bounds check to avoid a panic. Release note (bug fix): Fixed a crash that could happen when formatting a tuple with an unknown type. 95294: sql: make pg_description aware of builtin function descriptions r=rafiss,msirek a=knz Epic: CRDB-23454 Fixes #95292. Needed for #88061. First commit from #95289. This also extends the completion rules to properly handle functions in multiple namespaces. Release note (bug fix): `pg_catalog.pg_description` and `pg_catalog.obj_description()` are now able to retrieve the descriptive help for built-in functions. 95356: server: remove unused migrationExecutor r=ajwerner a=ajwerner This is no longer referenced since #91627. Epic: none Release note: None Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com> Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Andrew Werner <awerner32@gmail.com>
This commit adds replica info and range descriptor collection from partially available cluster that lost quorum on some ranges. Collection is done using AdminServer calls. Cluster wide call performs fan-out using node info from gossip. Local replica info collection on nodes is done by scanning storages.
Release note: File format used for transient loss of quorum recovery files has changed. It is not possible to use replica info files generated by earlier versions to be used with current and future versions.
Fixes #93040
Note that we recovery doesn't need collect command to be "remote capable" but allowing this makes development and debugging simpler as you can create local snapshot from a cluster that you can subsequently verify you planning against.
Touches #74135