Skip to content

loqrecovery: add replica info collection to admin server#93157

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
aliher1911:loq_05online_collect_replicas
Jan 17, 2023
Merged

loqrecovery: add replica info collection to admin server#93157
craig[bot] merged 1 commit intocockroachdb:masterfrom
aliher1911:loq_05online_collect_replicas

Conversation

@aliher1911
Copy link
Copy Markdown
Contributor

@aliher1911 aliher1911 commented Dec 6, 2022

This commit adds replica info and range descriptor collection from partially available cluster that lost quorum on some ranges. Collection is done using AdminServer calls. Cluster wide call performs fan-out using node info from gossip. Local replica info collection on nodes is done by scanning storages.

Release note: File format used for transient loss of quorum recovery files has changed. It is not possible to use replica info files generated by earlier versions to be used with current and future versions.

Fixes #93040

Note that we recovery doesn't need collect command to be "remote capable" but allowing this makes development and debugging simpler as you can create local snapshot from a cluster that you can subsequently verify you planning against.

Touches #74135

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 2 times, most recently from bc3f976 to c321b6e Compare December 6, 2022 20:54
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 3 times, most recently from 4a20157 to de43c15 Compare December 20, 2022 14:10
@aliher1911 aliher1911 self-assigned this Dec 20, 2022
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 9 times, most recently from f1990db to a8c4c54 Compare December 23, 2022 15:26
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 5 times, most recently from 3eda780 to cbc19c6 Compare January 3, 2023 19:18
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 4 times, most recently from d1c0707 to 3284138 Compare January 5, 2023 16:00
@aliher1911 aliher1911 marked this pull request as ready for review January 5, 2023 17:31
@aliher1911 aliher1911 requested review from a team as code owners January 5, 2023 17:31
@aliher1911 aliher1911 requested a review from a team January 5, 2023 17:31
@aliher1911 aliher1911 requested review from benbardin and miretskiy and removed request for a team January 11, 2023 15:55
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch from 0168d11 to c5781fd Compare January 11, 2023 15:59
@dhartunian dhartunian removed the request for review from a team January 11, 2023 17:28
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch from c5781fd to 96ab655 Compare January 11, 2023 18:09
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 3 times, most recently from dae4d94 to 94bfa12 Compare January 12, 2023 11:39
// but in practice info collected from remote cluster will contain info per
// node. In case of offline collection, replicas will belong to all stores
// provided to the command regardless of owning node.
message NodeReplicaInfo {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add in a node_id field here now as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would not work very well with offline collection where we are collecting all stores. There's no restrictions on it being a single node as the comment says. We were advertising this as a way to mount stores from multiple containers into single host and them running collection. I'm not comfortable of changing old behaviours in small ways straight after introducing them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I suppose we could still group by node ID in those cases (each store must necessarily belong to one node), or otherwise explicitly mark the data as belonging to multiple nodes (e.g. with node_id = 0). But I'm also fine with deferring that until we actually need it -- for now, I think the only practical consequence is potential confusion/bugs for developers touching this code.

// version. It checks major and minor versions only. This is simple check for
// use with transient recovery info files only to prevent incorrect files being
// used in error.
func CanUseFileVersion(v roachpb.Version) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're checking the file version here, but only for the CLI command used to coordinate the change. Should it be possible to use an N-1 CLI version to recover an N version cluster or a mixed-version cluster? There's currently nothing preventing that, because RPC calls will allow mixing N and N-1 versions. If we allow that (which we currently do) then we have to provide N-1 backwards compatibility with all of the RPC APIs going forwards -- are we prepared to guarantee that? That also implies that we have to allow executing an N-1 plan on a node at version N. In either case, we must check the plan version during plan application.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is exactly that. For planning phase, we don't want to mix versions of replica infos and use ones that were created by the same version. When applying a plan, we should be able to use plan from previous version for the case of mixed version clusters. So until you finalized your upgrade you should use older version of tools.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node at version N has to remain compatible with a plan generated by a binary at version N-1, why can't the planner at version N remain compatible with data collected by a node at version N-1. Is that harder somehow? If we support running a binary at version N-1 against an unfinalized cluster at version N then the cluster already has to remain compatible both with a collector and planner at version N-1.

We discussed this a bit offline, and concluded that we want to support mixed-version clusters throughout the loss of quorum recovery process in the same way that we support mixed-version clusters elsewhere. However, this needs a bit of care with version checks and detection. We'll remove the mixed-version handling from this PR for now, and address mixed-version concerns throughout the loss of quorum process in a separate PR.

@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 2 times, most recently from f3c4b31 to 77fbf8f Compare January 13, 2023 11:48
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed just the server and cli changes.

Reviewed 2 of 22 files at r5, 7 of 18 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @benbardin, @erikgrinaker, and @miretskiy)


pkg/cli/debug_recover_loss_of_quorum.go line 218 at r6 (raw file):

func runDebugDeadReplicaCollect(cmd *cobra.Command, args []string) error {
	// We need cancellable context here to obtain grpc client connection.
	// getAdminClient will refuse otherwise.

I think it's correct to need a cancellable context, but I don't understand this comment. What will getAdminClient refuse?


pkg/cli/debug_recover_loss_of_quorum.go line 236 at r6 (raw file):

		if err != nil {
			return errors.Wrap(err,
				"failed to retrieve replica info from cluster. "+

remove period at end. As you've written it the result of this would be failed to retrieve replica info from cluster.: rest of the error

Try this instead:

errors.WithHint(errors.Wrap(err, "failed to retrieve replica info from cluster"),
   "Check the cluster's health and retry the operation.")

pkg/cli/debug_recover_loss_of_quorum.go line 453 at r6 (raw file):

			return loqrecoverypb.ClusterReplicaInfo{}, errors.Wrapf(err,
				"failed to unmarshal replica info from file %q, ensure that replica info file is "+
					"generated with the same binary version and file is not corrupted",

Use errors.WithHint.


pkg/cli/debug_recover_loss_of_quorum.go line 458 at r6 (raw file):

		if err = loqrecovery.CanUseFileVersion(nodeReplicas.CockroachVersion); err != nil {
			return loqrecoverypb.ClusterReplicaInfo{}, errors.Wrapf(err,
				"incorrect data version in file %s", filename)

%q for consistency.


pkg/server/admin.go line 3281 at r6 (raw file):

	visitor func(nodeID roachpb.NodeID, client serverpb.AdminClient) error,
) error {

nit: remove empty line.


pkg/server/admin.go line 3298 at r6 (raw file):

				continue
			}
			err = visitor(nodeID, client)

Maybe log the error if non-nil (even if retried). This will help us troubleshoot retry loops that don't make progress.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving assuming we strip out the mixed-version stuff for now.

for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
log.Infof(ctx, "visiting node n%d, attempt %d", nodeID, r.CurrentAttempt())
var client serverpb.AdminClient
client, err = s.dialNode(ctx, nodeID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency on dialNode() is a bit iffy because it calls through to kvFanoutClient, which in general uses KV to retrieve node information. While dialNode() doesn't currently incur any KV calls, it very well could in the future.

I think we should be explicit about only depending on gossip here, since that's critical for this to continue functioning in broken clusters. We can pull VisitAvailableNodesWithRetry out into a standalone helper function under loqrecovery that takes a gossip client as a parameter, and reimplement dialNode() there (which is trivial). That way we're certain that this never relies on KV. It also removes the reverse dependency between systemAdminServer and loqrecovery.Server. Alternatively we could implement a gossipFanoutClient and pass that in, but that might be a bit overkill.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ditch all the extra machinery that does heartbeats, connection sharing and everything else and just stick to pure admin connection to do the fan out?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass in the RPC context too, if we need it. Seems reasonable to use the same machinery that we normally use.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, make it a method on loqrecovery.Server and pass in gossip and the RPC context during construction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I prefer the last idea of passing at construction as all the fields are static. But I think it makes more sense to make function a closure so that it could always be replaced for test. Otherwise it becomes more and more entangled with other components which is tough to break up for tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just use a testing knob as well, and dispatch to that at the start of the call when set.

// version. It checks major and minor versions only. This is simple check for
// use with transient recovery info files only to prevent incorrect files being
// used in error.
func CanUseFileVersion(v roachpb.Version) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node at version N has to remain compatible with a plan generated by a binary at version N-1, why can't the planner at version N remain compatible with data collected by a node at version N-1. Is that harder somehow? If we support running a binary at version N-1 against an unfinalized cluster at version N then the cluster already has to remain compatible both with a collector and planner at version N-1.

We discussed this a bit offline, and concluded that we want to support mixed-version clusters throughout the loss of quorum recovery process in the same way that we support mixed-version clusters elsewhere. However, this needs a bit of care with version checks and detection. We'll remove the mixed-version handling from this PR for now, and address mixed-version concerns throughout the loss of quorum process in a separate PR.

@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch 2 times, most recently from 68cb761 to 38ff041 Compare January 16, 2023 13:26
This commit adds replica info and range descriptor collection from
partially available cluster that lost quorum on some ranges.
Collection is done using AdminServer calls. Cluster wide call
performs fan-out using node info from gossip. Local replica info
collection on nodes is done by scanning storages.

Release note (cli change): File format used for transient loss of
quorum recovery files has changed. It is not possible to use replica
info files generated by earlier versions to be used with current and
future versions.
@aliher1911 aliher1911 force-pushed the loq_05online_collect_replicas branch from 38ff041 to 76fc533 Compare January 16, 2023 16:22
@aliher1911
Copy link
Copy Markdown
Contributor Author

bors r=erikgrinaker,knz

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 17, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 17, 2023

Build succeeded:

@craig craig bot merged commit aaba7ad into cockroachdb:master Jan 17, 2023
@aliher1911 aliher1911 deleted the loq_05online_collect_replicas branch January 17, 2023 17:03
craig bot pushed a commit that referenced this pull request Jan 17, 2023
94239: loqrecovery: use captured meta range content for LOQ plans r=erikgrinaker a=aliher1911

Note: only last commit belongs to this PR. Will update description once #93157 lands.

Previously loss of quorum recovery planner was using local replica info collected from all nodes to find source of truth for replicas that lost quorum.
With online approach local info snapshots don't have atomicity. This could cause planner to fail if available replicas are caught in different states on different nodes.
This commit adds alternative planning approach when range metadata is available. Instead of fixing individual replicas that can't make progress it finds ranges that can't make progress from metadata using descriptors and updates their replicas to recover from loss of quorum.
This commit also adds replica collection stage as a part of make-plan command itself. To invoke collection from a cluster instead of using files one needs to provide --host and other standard cluster connection related flags (--cert-dir, --insecure etc.) as appropriate.

Example command output for a local cluster with 3 out of 5 nodes surrvivng looks like:
```
~/tmp ❯❯❯ cockroach debug recover make-plan --insecure --host 127.0.0.1:26257 >recover-plan.json
Nodes scanned:           3
Total replicas analyzed: 228
Ranges without quorum:   15
Discarded live replicas: 0

Proposed changes:
  range r4:/System/tsd updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r80:/Table/106/1 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2].
  range r87:/Table/106/1/"paris"/"\xcc\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x00(" updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2].
  range r88:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r105:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica (n3,s3):3 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):1,(n4,s4):2].
  range r98:/Table/107/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\x03" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r95:/Table/107/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x06" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3].
  range r125:/Table/107/1/"washington dc"/"DDDDDDD\x00\x80\x00\x00\x00\x00\x00\x00\x04" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3].
  range r115:/Table/108/1/"boston"/"8Q\xeb\x85\x1e\xb8B\x00\x80\x00\x00\x00\x00\x00\x00n" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r104:/Table/108/1/"new york"/"\x1c(\xf5\u008f\\I\x00\x80\x00\x00\x00\x00\x00\x007" updating replica (n2,s2):2 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):3].
  range r102:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3].
  range r126:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3].
  range r86:/Table/108/3 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):3,(n5,s5):2].
  range r59:/Table/109/1 updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r65:/Table/111/1 updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].

Discovered dead nodes would be marked as decommissioned:
  n4, n5


Proceed with plan creation [y/N] y
Plan created.
To stage recovery application in half-online mode invoke:

'cockroach debug recover apply-plan  --host=127.0.0.1:26257 --insecure=true <plan file>'

Alternatively distribute plan to below nodes and invoke 'debug recover apply-plan --store=<store-dir> <plan file>' on:
- node n2, store(s) s2
- node n1, store(s) s1
- node n3, store(s) s3
```

Release note: None

Fixes: #93038
Fixes: #93046

94948: changefeedccl: give notice when metrics_label set without child_metrics r=samiskin a=samiskin

Resolves #75682

Surfaces a notice of
```
server.child_metrics.enabled is set to false, metrics will only be published to the '<scope>' label when it is set to true"
```
When child_metrics setting isn't enabled during changefeed creation

Release note (enterprise change): Changefeeds created/altered with a metrics_label set while server.child_metrics.enabled is false will now provide the user a notice upon creation.

95009: tree: fix panic when encoding tuple r=rafiss a=rafiss

fixes #95008

This adds a bounds check to avoid a panic.

Release note (bug fix): Fixed a crash that could happen when formatting a tuple with an unknown type.

95294: sql: make pg_description aware of builtin function descriptions r=rafiss,msirek a=knz

Epic: CRDB-23454
Fixes #95292.
Needed for #88061. 

First commit from #95289.

This also extends the completion rules to properly handle
functions in multiple namespaces.

Release note (bug fix): `pg_catalog.pg_description` and `pg_catalog.obj_description()` are now able to retrieve the descriptive help for built-in functions.

95356: server: remove unused migrationExecutor r=ajwerner a=ajwerner

This is no longer referenced since #91627.

Epic: none

Release note: None

Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com>
Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

loqrecovery: add replica info collector

4 participants