loqrecovery: add replica info collection to admin server by aliher1911 · Pull Request #93157 · cockroachdb/cockroach

aliher1911 · 2022-12-06T20:32:00Z

This commit adds replica info and range descriptor collection from partially available cluster that lost quorum on some ranges. Collection is done using AdminServer calls. Cluster wide call performs fan-out using node info from gossip. Local replica info collection on nodes is done by scanning storages.

Release note: File format used for transient loss of quorum recovery files has changed. It is not possible to use replica info files generated by earlier versions to be used with current and future versions.

Fixes #93040

Note that we recovery doesn't need collect command to be "remote capable" but allowing this makes development and debugging simpler as you can create local snapshot from a cluster that you can subsequently verify you planning against.

Touches #74135

cockroach-teamcity · 2022-12-06T20:32:09Z

This change is

pkg/gossip/gossip.go

pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto

erikgrinaker · 2023-01-12T14:23:48Z

pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto

+// but in practice info collected from remote cluster will contain info per
+// node. In case of offline collection, replicas will belong to all stores
+// provided to the command regardless of owning node.
 message NodeReplicaInfo {


We should probably add in a node_id field here now as well.

That would not work very well with offline collection where we are collecting all stores. There's no restrictions on it being a single node as the comment says. We were advertising this as a way to mount stores from multiple containers into single host and them running collection. I'm not comfortable of changing old behaviours in small ways straight after introducing them.

I see. I suppose we could still group by node ID in those cases (each store must necessarily belong to one node), or otherwise explicitly mark the data as belonging to multiple nodes (e.g. with node_id = 0). But I'm also fine with deferring that until we actually need it -- for now, I think the only practical consequence is potential confusion/bugs for developers touching this code.

pkg/cli/debug_recover_loss_of_quorum.go

erikgrinaker · 2023-01-12T14:52:09Z

pkg/kv/kvserver/loqrecovery/utils.go

+// version. It checks major and minor versions only. This is simple check for
+// use with transient recovery info files only to prevent incorrect files being
+// used in error.
+func CanUseFileVersion(v roachpb.Version) error {


We're checking the file version here, but only for the CLI command used to coordinate the change. Should it be possible to use an N-1 CLI version to recover an N version cluster or a mixed-version cluster? There's currently nothing preventing that, because RPC calls will allow mixing N and N-1 versions. If we allow that (which we currently do) then we have to provide N-1 backwards compatibility with all of the RPC APIs going forwards -- are we prepared to guarantee that? That also implies that we have to allow executing an N-1 plan on a node at version N. In either case, we must check the plan version during plan application.

My idea is exactly that. For planning phase, we don't want to mix versions of replica infos and use ones that were created by the same version. When applying a plan, we should be able to use plan from previous version for the case of mixed version clusters. So until you finalized your upgrade you should use older version of tools.

If a node at version N has to remain compatible with a plan generated by a binary at version N-1, why can't the planner at version N remain compatible with data collected by a node at version N-1. Is that harder somehow? If we support running a binary at version N-1 against an unfinalized cluster at version N then the cluster already has to remain compatible both with a collector and planner at version N-1.

We discussed this a bit offline, and concluded that we want to support mixed-version clusters throughout the loss of quorum recovery process in the same way that we support mixed-version clusters elsewhere. However, this needs a bit of care with version checks and detection. We'll remove the mixed-version handling from this PR for now, and address mixed-version concerns throughout the loss of quorum process in a separate PR.

pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto

pkg/kv/kvserver/loqrecovery/server.go

knz

I reviewed just the server and cli changes.

Reviewed 2 of 22 files at r5, 7 of 18 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @benbardin, @erikgrinaker, and @miretskiy)

pkg/cli/debug_recover_loss_of_quorum.go line 218 at r6 (raw file):

func runDebugDeadReplicaCollect(cmd *cobra.Command, args []string) error {
	// We need cancellable context here to obtain grpc client connection.
	// getAdminClient will refuse otherwise.

I think it's correct to need a cancellable context, but I don't understand this comment. What will getAdminClient refuse?

pkg/cli/debug_recover_loss_of_quorum.go line 236 at r6 (raw file):

		if err != nil {
			return errors.Wrap(err,
				"failed to retrieve replica info from cluster. "+

remove period at end. As you've written it the result of this would be failed to retrieve replica info from cluster.: rest of the error

Try this instead:

errors.WithHint(errors.Wrap(err, "failed to retrieve replica info from cluster"),
   "Check the cluster's health and retry the operation.")

pkg/cli/debug_recover_loss_of_quorum.go line 453 at r6 (raw file):

			return loqrecoverypb.ClusterReplicaInfo{}, errors.Wrapf(err,
				"failed to unmarshal replica info from file %q, ensure that replica info file is "+
					"generated with the same binary version and file is not corrupted",

Use errors.WithHint.

pkg/cli/debug_recover_loss_of_quorum.go line 458 at r6 (raw file):

		if err = loqrecovery.CanUseFileVersion(nodeReplicas.CockroachVersion); err != nil {
			return loqrecoverypb.ClusterReplicaInfo{}, errors.Wrapf(err,
				"incorrect data version in file %s", filename)

%q for consistency.

pkg/server/admin.go line 3281 at r6 (raw file):

	visitor func(nodeID roachpb.NodeID, client serverpb.AdminClient) error,
) error {

nit: remove empty line.

pkg/server/admin.go line 3298 at r6 (raw file):

				continue
			}
			err = visitor(nodeID, client)

Maybe log the error if non-nil (even if retried). This will help us troubleshoot retry loops that don't make progress.

erikgrinaker

Approving assuming we strip out the mixed-version stuff for now.

erikgrinaker · 2023-01-16T10:34:48Z

pkg/server/admin.go

+		for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
+			log.Infof(ctx, "visiting node n%d, attempt %d", nodeID, r.CurrentAttempt())
+			var client serverpb.AdminClient
+			client, err = s.dialNode(ctx, nodeID)


This dependency on dialNode() is a bit iffy because it calls through to kvFanoutClient, which in general uses KV to retrieve node information. While dialNode() doesn't currently incur any KV calls, it very well could in the future.

I think we should be explicit about only depending on gossip here, since that's critical for this to continue functioning in broken clusters. We can pull VisitAvailableNodesWithRetry out into a standalone helper function under loqrecovery that takes a gossip client as a parameter, and reimplement dialNode() there (which is trivial). That way we're certain that this never relies on KV. It also removes the reverse dependency between systemAdminServer and loqrecovery.Server. Alternatively we could implement a gossipFanoutClient and pass that in, but that might be a bit overkill.

Should we ditch all the extra machinery that does heartbeats, connection sharing and everything else and just stick to pure admin connection to do the fan out?

We can pass in the RPC context too, if we need it. Seems reasonable to use the same machinery that we normally use.

Alternatively, make it a method on loqrecovery.Server and pass in gossip and the RPC context during construction.

Yeah, I prefer the last idea of passing at construction as all the fields are static. But I think it makes more sense to make function a closure so that it could always be replaced for test. Otherwise it becomes more and more entangled with other components which is tough to break up for tests.

Could just use a testing knob as well, and dispatch to that at the start of the call when set.

erikgrinaker · 2023-01-16T11:29:19Z

pkg/kv/kvserver/loqrecovery/utils.go

+// version. It checks major and minor versions only. This is simple check for
+// use with transient recovery info files only to prevent incorrect files being
+// used in error.
+func CanUseFileVersion(v roachpb.Version) error {


If a node at version N has to remain compatible with a plan generated by a binary at version N-1, why can't the planner at version N remain compatible with data collected by a node at version N-1. Is that harder somehow? If we support running a binary at version N-1 against an unfinalized cluster at version N then the cluster already has to remain compatible both with a collector and planner at version N-1.

We discussed this a bit offline, and concluded that we want to support mixed-version clusters throughout the loss of quorum recovery process in the same way that we support mixed-version clusters elsewhere. However, this needs a bit of care with version checks and detection. We'll remove the mixed-version handling from this PR for now, and address mixed-version concerns throughout the loss of quorum process in a separate PR.

This commit adds replica info and range descriptor collection from partially available cluster that lost quorum on some ranges. Collection is done using AdminServer calls. Cluster wide call performs fan-out using node info from gossip. Local replica info collection on nodes is done by scanning storages. Release note (cli change): File format used for transient loss of quorum recovery files has changed. It is not possible to use replica info files generated by earlier versions to be used with current and future versions.

aliher1911 · 2023-01-17T14:02:40Z

bors r=erikgrinaker,knz

craig · 2023-01-17T15:12:08Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-17T16:55:56Z

Build succeeded:

Bazel Essential CI (Cockroach)

94239: loqrecovery: use captured meta range content for LOQ plans r=erikgrinaker a=aliher1911 Note: only last commit belongs to this PR. Will update description once #93157 lands. Previously loss of quorum recovery planner was using local replica info collected from all nodes to find source of truth for replicas that lost quorum. With online approach local info snapshots don't have atomicity. This could cause planner to fail if available replicas are caught in different states on different nodes. This commit adds alternative planning approach when range metadata is available. Instead of fixing individual replicas that can't make progress it finds ranges that can't make progress from metadata using descriptors and updates their replicas to recover from loss of quorum. This commit also adds replica collection stage as a part of make-plan command itself. To invoke collection from a cluster instead of using files one needs to provide --host and other standard cluster connection related flags (--cert-dir, --insecure etc.) as appropriate. Example command output for a local cluster with 3 out of 5 nodes surrvivng looks like: ``` ~/tmp ❯❯❯ cockroach debug recover make-plan --insecure --host 127.0.0.1:26257 >recover-plan.json Nodes scanned: 3 Total replicas analyzed: 228 Ranges without quorum: 15 Discarded live replicas: 0 Proposed changes: range r4:/System/tsd updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r80:/Table/106/1 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2]. range r87:/Table/106/1/"paris"/"\xcc\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x00(" updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2]. range r88:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r105:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica (n3,s3):3 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):1,(n4,s4):2]. range r98:/Table/107/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\x03" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r95:/Table/107/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x06" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3]. range r125:/Table/107/1/"washington dc"/"DDDDDDD\x00\x80\x00\x00\x00\x00\x00\x00\x04" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3]. range r115:/Table/108/1/"boston"/"8Q\xeb\x85\x1e\xb8B\x00\x80\x00\x00\x00\x00\x00\x00n" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r104:/Table/108/1/"new york"/"\x1c(\xf5\u008f\\I\x00\x80\x00\x00\x00\x00\x00\x007" updating replica (n2,s2):2 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):3]. range r102:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3]. range r126:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3]. range r86:/Table/108/3 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):3,(n5,s5):2]. range r59:/Table/109/1 updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. range r65:/Table/111/1 updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2]. Discovered dead nodes would be marked as decommissioned: n4, n5 Proceed with plan creation [y/N] y Plan created. To stage recovery application in half-online mode invoke: 'cockroach debug recover apply-plan --host=127.0.0.1:26257 --insecure=true <plan file>' Alternatively distribute plan to below nodes and invoke 'debug recover apply-plan --store=<store-dir> <plan file>' on: - node n2, store(s) s2 - node n1, store(s) s1 - node n3, store(s) s3 ``` Release note: None Fixes: #93038 Fixes: #93046 94948: changefeedccl: give notice when metrics_label set without child_metrics r=samiskin a=samiskin Resolves #75682 Surfaces a notice of ``` server.child_metrics.enabled is set to false, metrics will only be published to the '<scope>' label when it is set to true" ``` When child_metrics setting isn't enabled during changefeed creation Release note (enterprise change): Changefeeds created/altered with a metrics_label set while server.child_metrics.enabled is false will now provide the user a notice upon creation. 95009: tree: fix panic when encoding tuple r=rafiss a=rafiss fixes #95008 This adds a bounds check to avoid a panic. Release note (bug fix): Fixed a crash that could happen when formatting a tuple with an unknown type. 95294: sql: make pg_description aware of builtin function descriptions r=rafiss,msirek a=knz Epic: CRDB-23454 Fixes #95292. Needed for #88061. First commit from #95289. This also extends the completion rules to properly handle functions in multiple namespaces. Release note (bug fix): `pg_catalog.pg_description` and `pg_catalog.obj_description()` are now able to retrieve the descriptive help for built-in functions. 95356: server: remove unused migrationExecutor r=ajwerner a=ajwerner This is no longer referenced since #91627. Epic: none Release note: None Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com> Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Andrew Werner <awerner32@gmail.com>

aliher1911 force-pushed the loq_05online_collect_replicas branch 2 times, most recently from bc3f976 to c321b6e Compare December 6, 2022 20:54

aliher1911 force-pushed the loq_05online_collect_replicas branch 3 times, most recently from 4a20157 to de43c15 Compare December 20, 2022 14:10

aliher1911 self-assigned this Dec 20, 2022

aliher1911 force-pushed the loq_05online_collect_replicas branch 9 times, most recently from f1990db to a8c4c54 Compare December 23, 2022 15:26

aliher1911 force-pushed the loq_05online_collect_replicas branch 5 times, most recently from 3eda780 to cbc19c6 Compare January 3, 2023 19:18

aliher1911 force-pushed the loq_05online_collect_replicas branch 4 times, most recently from d1c0707 to 3284138 Compare January 5, 2023 16:00

aliher1911 marked this pull request as ready for review January 5, 2023 17:31

aliher1911 requested review from a team as code owners January 5, 2023 17:31

aliher1911 requested a review from a team January 5, 2023 17:31

aliher1911 requested review from benbardin and miretskiy and removed request for a team January 11, 2023 15:55

aliher1911 force-pushed the loq_05online_collect_replicas branch from 0168d11 to c5781fd Compare January 11, 2023 15:59

dhartunian removed the request for review from a team January 11, 2023 17:28

aliher1911 force-pushed the loq_05online_collect_replicas branch from c5781fd to 96ab655 Compare January 11, 2023 18:09

aliher1911 requested a review from erikgrinaker January 11, 2023 18:10

aliher1911 force-pushed the loq_05online_collect_replicas branch 3 times, most recently from dae4d94 to 94bfa12 Compare January 12, 2023 11:39

erikgrinaker reviewed Jan 12, 2023

View reviewed changes

pkg/kv/kvserver/loqrecovery/server.go Show resolved Hide resolved

aliher1911 force-pushed the loq_05online_collect_replicas branch 2 times, most recently from f3c4b31 to 77fbf8f Compare January 13, 2023 11:48

knz reviewed Jan 16, 2023

View reviewed changes

erikgrinaker approved these changes Jan 16, 2023

View reviewed changes

aliher1911 force-pushed the loq_05online_collect_replicas branch 2 times, most recently from 68cb761 to 38ff041 Compare January 16, 2023 13:26

aliher1911 force-pushed the loq_05online_collect_replicas branch from 38ff041 to 76fc533 Compare January 16, 2023 16:22

craig bot merged commit aaba7ad into cockroachdb:master Jan 17, 2023

aliher1911 deleted the loq_05online_collect_replicas branch January 17, 2023 17:03

cockroach-teamcity mentioned this pull request Jan 18, 2023

PR #93157 - loqrecovery: add replica info collection to admin server cockroachdb/docs#16024

Closed

Conversation

aliher1911 commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 6, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aliher1911 commented Jan 17, 2023

Uh oh!

craig bot commented Jan 17, 2023

Uh oh!

craig bot commented Jan 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aliher1911 commented Dec 6, 2022 •

edited

Loading