Skip to content

cli: commands to perform unsafe recovery on quorum loss#71996

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
aliher1911:collect_info_proto
Dec 21, 2021
Merged

cli: commands to perform unsafe recovery on quorum loss#71996
craig[bot] merged 1 commit intocockroachdb:masterfrom
aliher1911:collect_info_proto

Conversation

@aliher1911
Copy link
Copy Markdown
Contributor

@aliher1911 aliher1911 commented Oct 26, 2021

Previously, if range quorum was lost because too many replicas becoming
permanently unavailable, cockroach could either be recovered from
a backup, or debug-unsafe-remove-replicas command applied on each node
to recover to a potentially inconsistent state.
This command has some limitations on how it decides which replica is
the most up to date one and in not usually recommended.
This diff introduces the new process for the recovery. It doesn't
guarantee consistency after recovery, but it provides additional checks
and lays the foundation for further recovery work.

Release note: None

Addresses: #71863, #71862, #71861

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aliher1911 aliher1911 force-pushed the collect_info_proto branch 6 times, most recently from 6b43a6b to f2a7134 Compare October 29, 2021 13:21
@aliher1911 aliher1911 self-assigned this Oct 29, 2021
@aliher1911 aliher1911 changed the title <pkg>: <short description - lowercase, no final period> cli: add commands to unsafely resolve loss of quorum Nov 2, 2021
@aliher1911 aliher1911 changed the title cli: add commands to unsafely resolve loss of quorum cli: add debug cli commands to unsafely resolve loss of quorum Nov 2, 2021
@aliher1911 aliher1911 force-pushed the collect_info_proto branch 4 times, most recently from a544586 to dea5d1e Compare November 9, 2021 19:23
@aliher1911 aliher1911 changed the title cli: add debug cli commands to unsafely resolve loss of quorum cli: commands to perform unsafe recovery on quorum loss Nov 9, 2021
@aliher1911 aliher1911 force-pushed the collect_info_proto branch 2 times, most recently from 8fe055f to 31a8802 Compare November 12, 2021 23:55
@aliher1911 aliher1911 force-pushed the collect_info_proto branch 2 times, most recently from baca29a to 7216e8f Compare November 23, 2021 14:36
@aliher1911 aliher1911 marked this pull request as ready for review November 24, 2021 15:23
@aliher1911 aliher1911 requested review from a team as code owners November 24, 2021 15:23
@aliher1911
Copy link
Copy Markdown
Contributor Author

I'm not keen on splitting this PR as if we cut it along the original lines in issues, each part would provide no meaningful functionality by itself. Currently it covers the workflow end-to-end and only provides functionality that we already have with respect to how the replica is picked. It also provides a data driven parser to make tests for newer use cases.
That said, I can try to cut it in smaller bits if needed before we are too deep into the review of actual code.

tbg
tbg previously requested changes Nov 25, 2021
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, that was a big one. Looks good so far! A few high-level comments:

  • Not everything is implemented here. This is expected but as a reviewer I'm left wondering which of the follow-up work you're aware of. For a PR this big I'd expect a much more comprehensive commit message that references issues (which also helps reference this PR from them in turn) and points out what's in scope for this PR and what isn't. For example, I'm left wondering if the "check that recovered cluster doesn't have holes in keyspace" is tracked anywhere or if I need to point it out and check. The PR should let me know where we are leaving things. Since it doesn't, I was reviewing with uncertainty about what to point out and what to assume would be taken care of later.
  • The comments in general could be more detailed. A fair amount of them that exist are sort of saying the obvious thing without explaining the greater scheme of things (the comments were telling me what the next line does, rather than telling me why). I realize that this PR is still somewhat in progress and I support getting review as early as possible, so this is just a reminder to keep comments and documentation in mind as the foundations settle and the PR gets closer to a merge, at which point I'll also try to give more suggestions - it's just really tiring at this size of PR
  • A number of my comments suggest a stricter separation of cli-like functionality (interactivity, interacting with the flags, etc) and the meat. I do like the idea of pulling the business logic into pkg/kv/kvserver/loqrecovery so let's discuss (with the other reviewers) whether to do that. I'm strongly in favor but I've been overruled before :-), kooking for @erikgrinaker's opinion in particular.
  • The datadriven stuff looks great, thanks for the effort that must have taken. I didn't review this in detail yet though, will wait for the other dust to settle.

each part would provide no meaningful functionality by itself.

I agree that once you have a 2700+ LOC PR it's not worth splitting it up. But also it was painful to review this giant chunk, I've been at it for a few hours and I'm still not done. There is generally an inverse relationship between the size of a commit and the quality of the review (which again correlates with code quality, documentation, etc). Perhaps the issues weren't the right way to split up the work, that is well possible, but in my opinion there is always some way to make it easier for the reviewer to provide a high-quality review by carving out chunks and thus "checkpointing" not only the work but also the scope and context. Concretely, here the cli scaffolding could've been set up in a separate commit. cli stuff is very different from the actual recovery logic and it has its own idiosyncrasies and reviewer mindset that I don't need to have for the rest of the review. Doing it that way might've presented an earlier opportunity for a reviewer like myself to suggest the kvrecovery package and the separation of concerns, which now comes with extra work for you, and a extra review overhead as well. Doing this perfectly for one reviewer's taste is a a bit of an art and for two mostly impossible, but I think here it would've been a good idea for all but the most exotic reviewers.

Reviewed 10 of 13 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @erikgrinaker)


pkg/cli/debug.go, line 1520 at r2 (raw file):

}

// debugSubCommandsForRocksDB lists debug subcommands that access rocksdb through the engine

s/rocksdb/pebble/i throughout.


pkg/cli/debug.go, line 1524 at r2 (raw file):

// This list is separate from debugCmdsForRocksDB as it is used to enable encryption options
// but not when building top level debug command list.
// Note: do NOT include commands that just call rocksdb code without setting up an engine.

After reading this comment, I still have no idea why there are two lists and when which one goes where. All of the commands in debugCmdsForPebble set up an engine, and yet they're all not in this list. Don't they also need encryption flags? See if you can clear this up.


pkg/cli/debug.go, line 1551 at r2 (raw file):

// DebugCommandsRequiringEncryption lists debug (sub)commands that access rocksdb through the engine
// and need encryption flags (injected by CCL code).
// Note: do NOT include commands that just call rocksdb code without setting up an engine.

ditto


pkg/cli/debug.go, line 1554 at r2 (raw file):

var DebugCommandsRequiringEncryption = append(debugCmdsForPebble, debugSubCommandsForPebble...)

// debugInteractive contains commands that need a confirmation flag

.

debugNeedsConfirmation instead of debugInteractive?


pkg/cli/debug.go, line 1665 at r2 (raw file):

	f = recoverDeadReplicaPlanCmd.Flags()
	f.StringVarP(&debugDeadReplicaPlanOpts.outputFileName, "plan", "o", "replica-remove-plan.json",

"output-file" instead of "plan" to keep things consistent?


pkg/cli/debug.go, line 1712 at r2 (raw file):

	for _, cmd := range debugInteractive {
		AddPersistentPreRunE(cmd, validateConfirmationFlag)

Unless I'm missing some technicality that makes this a non-starter, the idiomatic way of doing this is to add a new type that implements pflag.Value (the Set method does the validation), and to use f.Var instead of StringVar.


pkg/cli/debug.go, line 1714 at r2 (raw file):

		AddPersistentPreRunE(cmd, validateConfirmationFlag)
		f = cmd.Flags()
		f.StringVar(&interactivePromptOpts.action, "confirm", "p",

snazzy!


pkg/cli/debug_recover_loss_of_quorum.go, line 104 at r2 (raw file):

0. Stop the cluster

1. Run cockroach debug recover collect-info on every node to collect

Could you make sure to point out that the commands need to be invoked per store, not once per node. (Or once per node, supplying all stores -that should both work).


pkg/cli/debug_recover_loss_of_quorum.go, line 128 at r2 (raw file):

func init() {
	debugRecoverCmd.AddCommand(
		recoverDeadReplicaCollectCmd,

Let's name these variables like the commands, so debugRecoverCollectInfoCmd, etc, so that nobody has to ever remember the custom name for these.


pkg/cli/debug_recover_loss_of_quorum.go, line 143 at r2 (raw file):

See debug recover help for more details on how to use this command.
`,
	Args: cobra.MaximumNArgs(1),

Without args it goes to stdout? Or do you want ExactArgs(1). Explain it in Long above if it is in fact optional (but probably not a good idea, since stdout receives other writes too due to prompt?)


pkg/cli/debug_recover_loss_of_quorum.go, line 147 at r2 (raw file):

}

var debugDeadReplicaCollectOpts struct {

ditto about consistent naming


pkg/cli/debug_recover_loss_of_quorum.go, line 152 at r2 (raw file):

func runDebugDeadReplicaCollect(_ *cobra.Command, args []string) error {
	ctx := context.Background()

cmd.Context() is available from the *Command.


pkg/cli/debug_recover_loss_of_quorum.go, line 164 at r2 (raw file):

		}
		stores = append(stores, db)
		defer db.Close()

You don't need this since you're passing a stopper. (It's a little silly that the stopper has to be supplied, but now's not the time to clean this up, or maybe it is, depending on your mood).


pkg/cli/debug_recover_loss_of_quorum.go, line 173 at r2 (raw file):

	dest := ""
	promptOut := os.Stderr

Ah, so there is some special casing here. Make sure to document it please.


pkg/cli/debug_recover_loss_of_quorum.go, line 176 at r2 (raw file):

	if len(args) > 0 {
		dest = args[0]
		promptOut = os.Stdout

Could it be confusing to switch between Stdout and Stderr for interactiveness? I think we should consider keeping it simple and say that unless you're fully onboard with y everywhere, you have to provide a file. Otherwise, you can use stdout and there won't be anything there except the file. Or we say the prompt is always on stderr, I assume that would also just work.


pkg/cli/debug_recover_loss_of_quorum.go, line 178 at r2 (raw file):

		promptOut = os.Stdout
	}
	return marshalAndWriteResults(descriptors, dest, getConfirmAction(promptOut))

Does this command really need confirmation?


pkg/cli/debug_recover_loss_of_quorum.go, line 206 at r2 (raw file):

				RaftAppliedIndex:          rstate.RaftAppliedIndex,
				RaftCommittedIndex:        hstate.Commit,
				HasUncommittedDescriptors: false,

Looks like this is TODO, please call that out in a comment. Ideally there's a tracking issue that you can link.


pkg/cli/debug_recover_loss_of_quorum.go, line 218 at r2 (raw file):

}

var recoverDeadReplicaPlanCmd = &cobra.Command{

ditto name


pkg/cli/debug_recover_loss_of_quorum.go, line 226 at r2 (raw file):

This command will identify ranges where raft consensus could not progress and pick
survivor replicas to act as a source of truth.

Explain that this step also runs various sanity checks that try to prevent the operator from embarking on a futile recovery attempt.


pkg/cli/debug_recover_loss_of_quorum.go, line 236 at r2 (raw file):

}

var debugDeadReplicaPlanOpts struct {

ditto name


pkg/cli/debug_recover_loss_of_quorum.go, line 250 at r2 (raw file):

		}
		data, err := io.ReadAll(f)
		_ = f.Close()

up to here this can all be abbreviated to ioutil.ReadFile(filename)


pkg/cli/debug_recover_loss_of_quorum.go, line 256 at r2 (raw file):

		var nodeDescriptors []quorum.LocalRangeInfo
		if err = json.Unmarshal(data, &nodeDescriptors); err != nil {

We use jsonpb elsewhere, see for example here, we should probably use it here too.


pkg/cli/debug_recover_loss_of_quorum.go, line 267 at r2 (raw file):

		promptOut = os.Stderr
	}
	confirm := getConfirmAction(promptOut)

Please refactor all of these methods such that they don't rely on the global state. That is, make a little trampoline where the actual command sets up the parameters and then invokes a method that has all of the business logic (passing all of the configuration directly to it). For inspiration, see

if err != nil {
which is the business logic. The command only has the setup and plumbing as well as the confirmation prompt, but nothing that you may want to unit test. I would like the datadriven tests to not touch any of the config state in cli, i.e. operate purely on the business logic. (Haven't gotten to that part of the review yet).


pkg/cli/debug_recover_loss_of_quorum.go, line 273 at r2 (raw file):

	}

	// Run planner

.


pkg/cli/debug_recover_loss_of_quorum.go, line 285 at r2 (raw file):

	descriptors []quorum.LocalRangeInfo, deadStores []roachpb.StoreID, d dialog,
) ([]quorum.ReplicaUpdate, error) {
	liveStoreIDs, missingStoreIDs, err := validateReplicaSets(descriptors, deadStores)

maybe present instead of live? The word live already has too many meanings in CRDB. This would carry over to the docs on validateReplicaSets.


pkg/cli/debug_recover_loss_of_quorum.go, line 291 at r2 (raw file):

	d.printf("Found live stores: %v\n", joinStoreIDs(liveStoreIDs))
	if len(deadStores) == 0 {
		if ok, err := d.confirm(fmt.Sprintf("Found dead stores %s. Proceed with deletion", joinStoreIDs(missingStoreIDs))); err != nil {

I thought we would print this at the end of plan creation (i.e. in the main method), with a prompt to confirm that these are actually exactly the stores the operator expects to be dead, and to confirm that (depending on the settings). Then planReplicas would become non-interactive and instead of a dialog it could take a logger interface { Infof(string, ...interface{}) } that defaults to if log.V(1) { log.Infof(ctx, format, args...) } in prod and is hooked up to datadriven (maybe optionally) during datadriven testing.


pkg/cli/debug_recover_loss_of_quorum.go, line 301 at r2 (raw file):

	}

	plan := []quorum.ReplicaUpdate{}

nit: var quorum ReplicaUpdate is every so slightly more idiomatic (imo) since the canonical zero-element slice is the nil value, not the empty slice.


pkg/cli/debug_recover_loss_of_quorum.go, line 302 at r2 (raw file):

	plan := []quorum.ReplicaUpdate{}
	// Find ranges that lost quorum and create updates for them

.


pkg/cli/debug_recover_loss_of_quorum.go, line 376 at r2 (raw file):

				NodeID:    rangeDesc.NodeID,
				StoreID:   rangeDesc.StoreID,
				ReplicaID: desc.NextReplicaID,

If there is a survivor whose ReplicaID is >= desc.NextReplicaID we need to error out, do we do that now? In fact better than erroring out would be picking that other replica as a survivor, but let's make sure we catch & test this problem first and we can fix it later, it's unlikely to make much of a difference in practice.


pkg/cli/debug_recover_loss_of_quorum.go, line 388 at r2 (raw file):

, i.e. (barring operator error) the conclusive list of all remaining stores in the cluster.


pkg/cli/debug_recover_loss_of_quorum.go, line 395 at r2 (raw file):

func validateReplicaSets(
	descriptors []quorum.LocalRangeInfo, deadStores []roachpb.StoreID,
) (liveStoreIDs, missingStoreIDs storeIDSet, err error) {

_ error


pkg/cli/debug_recover_loss_of_quorum.go, line 396 at r2 (raw file):

	descriptors []quorum.LocalRangeInfo, deadStores []roachpb.StoreID,
) (liveStoreIDs, missingStoreIDs storeIDSet, err error) {
	// Find on live stores from all descriptors.

-on (but is this comment even worth it?)


pkg/cli/debug_recover_loss_of_quorum.go, line 398 at r2 (raw file):

Populate missingStoreIDs with all of the StoreIDs referenced in replica descriptors, and populate presentStoreIDs with all of the stores from which we have received plans.


pkg/cli/debug_recover_loss_of_quorum.go, line 406 at r2 (raw file):

		}
	}
	// Check that requested dead stores don't overlap with live.

// The difference between all referenced StoreIDs (missingStoreIDs) and the present StoreIDs (presentStoreIDs) should exactly equal the user-provided list of dead stores (deadStores), and the former must be a superset of the latter (since each descriptor found on a store references that store). Verify all of these conditions and error out if one of them does not hold.

We need to have unit tests for all of these cases (haven't fully reviewed yet, maybe they're already there).


pkg/cli/debug_recover_loss_of_quorum.go, line 429 at r2 (raw file):

	}
	if len(deadStores) > 0 && len(missingButNotDeadStoreIDs) > 0 {
		// We can't proceed with this if dead nodes were explicitly provided

.


pkg/cli/debug_recover_loss_of_quorum.go, line 446 at r2 (raw file):

	if filename != "" {
		// If file exists confirm overwriting it.

It's almost not worth quibbling about this, but as I already brought up elsewhere I'm not sure if it's worth asking for prompts on commands that simply write the desired output file. I'd trust @knz's judgment call on that but I don't think this is what we consistently do elsewhere in the cli. I do think it makes sense to avoid overwriting files but at least the ballast command simply errors out:

tobias@td:~/go/src/github.com/cockroachdb/cockroach$ ./cockroach debug ballast -z 1MiB foo.asd
tobias@td:~/go/src/github.com/cockroachdb/cockroach$ ./cockroach debug ballast -z 1MiB foo.asd
ERROR: file already exists
Failed running "debug ballast"

I don't feel strongly about this here. Just curious what the "right" thing to do is. I do like to keep it simple, and there's enough code here around this handling to have a bug.


pkg/cli/debug_recover_loss_of_quorum.go, line 468 at r2 (raw file):

}

var recoverDeadReplicaExecuteCmd = &cobra.Command{

ditto name


pkg/cli/debug_recover_loss_of_quorum.go, line 475 at r2 (raw file):

In ranges with a lost quorum, designated replicas would serve as a source of
truth and allow raft to progress so that data becomes accessible.

We'll want more docs on all of these commands but let's put that on the back burner for this PR.


pkg/cli/debug_recover_loss_of_quorum.go, line 483 at r2 (raw file):

}

var debugDeadReplicaRemoveOpts struct {

ditto name
Also the current name is a misnomer, sure it does remove replicas, but more importantly, it resuscitates the designated survivors with a config in which they can make progress again.


pkg/cli/debug_recover_loss_of_quorum.go, line 488 at r2 (raw file):

func runDebugDeadReplicaRemove(_ *cobra.Command, args []string) error {
	ctx := context.Background()

ditto ctx


pkg/cli/debug_recover_loss_of_quorum.go, line 491 at r2 (raw file):

	stopper := stop.NewStopper()
	defer stopper.Stop(context.Background())

ctx


pkg/cli/debug_recover_loss_of_quorum.go, line 499 at r2 (raw file):

	}

	data, err := io.ReadAll(f)

ioutil.ReadAll


pkg/cli/debug_recover_loss_of_quorum.go, line 502 at r2 (raw file):

	_ = f.Close()
	if err != nil {
		return errors.Wrapf(err, "failed to read removal plan file %s", planFile)

ditto "removal" naming, "read plan file" is probably enough here anyway


pkg/cli/debug_recover_loss_of_quorum.go, line 506 at r2 (raw file):

	var nodeUpdates []quorum.ReplicaUpdate
	if err = json.Unmarshal(data, &nodeUpdates); err != nil {

ditto about considering jsonpb.


pkg/cli/debug_recover_loss_of_quorum.go, line 510 at r2 (raw file):

	}

	// Open all stores

.


pkg/cli/debug_recover_loss_of_quorum.go, line 511 at r2 (raw file):

	// Open all stores
	localNodeID := roachpb.NodeID(0)

nit: var localNodeID roachpb.NodeID


pkg/cli/debug_recover_loss_of_quorum.go, line 528 at r2 (raw file):

				return errors.Errorf("Found stores from multiple node IDs n%d, n%d. Can only run in context of single node.", localNodeID, storeIdent.NodeID)
			}
			localNodeID = storeIdent.NodeID

I am confused why the concept of a NodeID plays a role here. Everything operates on StoreIDs - the designated survivors plan is logically a map from RangeID to StoreID (StoreID is unique across cluster). Try to keep NodeID out of it and if we absolutely need it, make clear why. (It's fine to use it for validation, but it looks like we're using it for more than that).


pkg/cli/debug_recover_loss_of_quorum.go, line 532 at r2 (raw file):

		stores[storeIdent.StoreID] = db
	}

Can stores be empty here? That would be user error. I know the command specifies ExactArgs but as far as this method is concerned it would constitute a bug to hand a plan that results in no writes.


pkg/cli/debug_recover_loss_of_quorum.go, line 533 at r2 (raw file):

	}

	// Perform deletion

.


pkg/cli/debug_recover_loss_of_quorum.go, line 534 at r2 (raw file):

	// Perform deletion
	err = removeReplicas(ctx, nodeUpdates, localNodeID, stores, getConfirmAction(os.Stdout))

Here too I would suggest the pattern in which removeReplicas is pure business logic that writes to a batch per store (instead of stores map) and the confirmation lives in this method. Plumbing interactivity down into business logic makes things more complex and when we can avoid it we should.


pkg/cli/debug_recover_loss_of_quorum.go, line 545 at r2 (raw file):

	d dialog,
) error {
	// Maybe track attempts to apply changes to filter cases where we run the same plan again?

This looks like a TODO, make it a proper one. As is, this is potentially causing confusion about what the code is or is not doing.

Also, add a TODO (maybe replacing this one?) about writing to logs the fact that the recovery happened. We need to discuss what the right logs are. Ideally you want audit-type logging, but this is the cli so we don't have that luxury. We could stash a file in the store directory or write a store-local key that then translates, on node startup, into a proper log structured log entry (without removing the store key - can't hurt to remember that fact forever). I think that is the right way to do it. Definitely not for this PR, but make sure to file an issue. (cc @erikgrinaker)


pkg/cli/debug_recover_loss_of_quorum.go, line 567 at r2 (raw file):

	if len(missing) > 0 {
		if ok, err := d.confirm(fmt.Sprintf(
			"Stores %s expected on the node but no paths were provided. Proceed with remaining stores",

I've been thinking that it might be the better approach to always expect a valid plan. If the plan is invalid and the operator wants to go ahead anyway, they need to edit the plan to make it valid. Haven't fully thought this through but I like how this makes clear which usage of this tool we support and which one we don't. We support valid, CRDB-generated plans only and everything else is an exceptional situation.


pkg/cli/debug_recover_loss_of_quorum.go, line 577 at r2 (raw file):

	for _, update := range plan {
		if nodeID != update.NodeID {

this would just be batch, ok := stores[update.StoreID]; if !ok { continue }, no need for a NodeID here?


pkg/cli/debug_recover_loss_of_quorum.go, line 587 at r2 (raw file):

		// store check.
		batch := batches[update.StoreID]
		if err := prepareRemoveReplicaFromStore(ctx, store, batch, update); err != nil {

applyRecoveryUpdate(...) could be a more apt name.


pkg/cli/debug_recover_loss_of_quorum.go, line 640 at r2 (raw file):

	// to make progress. It will elect itself leader and upreplicate.
	//
	// The old replica on s1 is untouched by this process. It will

We talked about a need to also replicaGC all the "non-designated survivors" (I hope this made it into the RFC, not sure) to ensure smooth operations (if they're still around, they are unavailable, and can hold up requests that accidentally get routed to them). Please make sure this is tracked since this PR isn't implementing it yet (and shouldn't try to). cc @erikgrinaker


pkg/cli/debug_recover_loss_of_quorum.go, line 668 at r2 (raw file):

		return err
	}
	// Sanity check that this is indeed the right range

.


pkg/cli/debug_recover_loss_of_quorum.go, line 672 at r2 (raw file):

		return errors.Errorf("unexpected range ID at key: expected r%d but found r%d", update.RangeID, desc.RangeID)
	}
	// Check if replica is in a fixed state already if we already applied the change.

Does this have any practical benefit? Isn't everything here idempotent?

Ideally we would also "CPut" the update here. For example, if someone stopped the stores, made some plans, restarted some stores and some of the stuck ranges actually managed to make progress, and then later we apply the plan, that would bounce (at least where the discrepancy could be detected). Not for now, but worth tracking or at least a TODO.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 50 at r2 (raw file):

*/

// Range info used for test data to avoid providing unnecessary fields that are not used in

you could dump all of these wrappers in a ....._util_test.go file to keep the interesting bits of this test more front and center.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 53 at r2 (raw file):

// replica removal.
type testReplicaInfo struct {
	// Replica location

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 57 at r2 (raw file):

	StoreID roachpb.StoreID `yaml:"StoreID"`

	// Descriptor as seen by replica

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 61 at r2 (raw file):

	StartKey   string                  `yaml:"StartKey"`
	EndKey     string                  `yaml:"EndKey"`
	Replicas   []replicaDescriptorView `yaml:"Replicas,flow"`

Looks like you've turned into a "making yaml look nice" deity 👍, I'll remember these attributes.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 64 at r2 (raw file):

	Generation roachpb.RangeGeneration `yaml:"Generation,omitempty"`

	// Raft state

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 69 at r2 (raw file):

	HasUncommittedDescriptors bool   `yaml:"HasUncommittedDescriptors"`

	// Intent presence?

TODO


pkg/cli/debug_recover_loss_of_quorum_test.go, line 90 at r2 (raw file):

is a wrapper type


pkg/cli/debug_recover_loss_of_quorum_test.go, line 106 at r2 (raw file):

}

// Store with its owning NodeID for easier grouping by owning nodes

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 123 at r2 (raw file):

}

func (e *quorumRecoveryEnv) Handle(t *testing.T, d datadriven.TestData) string {

might be worth a debug_recover_env_test.go file for the env.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 128 at r2 (raw file):

	switch d.Cmd {
	case "replication-data":
		// Populate in-mem engines with data

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 131 at r2 (raw file):

		out = e.handleReplicationData(t, d)
	case "collect-replica-info":
		// Collect one or more range info "files" from stores

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 134 at r2 (raw file):

		out, err = e.handleCollectReplicas(t, d)
	case "make-plan":
		// Make plan out of multiple collected replica info

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 137 at r2 (raw file):

		out, err = e.handleMakePlan(t, d)
	case "dump-store":
		// Dump the content of the store (all descriptors) for verification

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 140 at r2 (raw file):

		out = e.handleDumpStore(t, d)
	case "apply-plan":
		// Create a plan from listed replicas

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 158 at r2 (raw file):

	clock := hlc.NewClock(hlc.UnixNano, time.Millisecond*100)

	// Drop existing data

.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 161 at r2 (raw file):

	e.stores = make(map[roachpb.StoreID]wrappedStore)

	// load yaml from data into local range info

L .


pkg/cli/cliflags/flags.go, line 1667 at r2 (raw file):

Flag is syntactically identical to --store flag of start command, but only path
part is used. This is done to make flags interoperable between start and recover
commands.

Good idea.


pkg/cli/quorum/quorum.proto, line 13 at r2 (raw file):

syntax = "proto3";
package cockroach.quorum;
// TODO(oleg): sort out package and directory naming here

Reviewing all of this code I was actually thinking that we shouldn't dump most of it into cli but to keep the business logic and testing in pkg/kv/kvserver/loqrecovery (mod obligatory name 🚲 🏚️). Then quorum would become pkg/kv/kvserver/logrecoverypb. The cli package would only do the flags plumbing and engine, etc, setup and then call out to logrecovery.


pkg/cli/testdata/quorum/learners_lose, line 2 at r2 (raw file):

# Tests verifying that learners don't become survivors

Please give these tests a lot more prose. YAML supports inline comments (#) so you can do stuff like

- { NodeID: 2, StoreID: 2, ReplicaID: 2, ReplicaType: 1} # learner is highest surviving storeID, but must not be chosen

pkg/cli/testdata/quorum/learners_lose, line 4 at r2 (raw file):


replication-data
- NodeID: 1

Doesn't seem like you really need NodeID in these tests at all. I might be missing some reason that makes it convenient. Certainly if the logic uses it it needs to be here as well but as mentioned in earlier comments, I don't think the NodeID should play any role in the business logic.


pkg/cli/testdata/quorum/learners_lose, line 42 at r2 (raw file):

  StoreID: 1
  RangeID: 1
  StartKey: []

inconsistent StartKey, this printed /Min in replication-data


pkg/cli/testdata/quorum/learners_lose, line 44 at r2 (raw file):

  StartKey: []
  NewReplica:
    NodeID: 1

Seems unnecessary to duplicate these, the new replica's Store/Node is always same.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass to reply to comments, will review tomorrow.

I'd expect a much more comprehensive commit message

+1

I do like the idea of pulling the business logic into pkg/kv/kvserver/loqrecovery so let's discuss (with the other reviewers) whether to do that.

I agree. Especially since we want to do some of this stuff online down the road, so the server process would presumably need to import this logic too. But also just because non-trivial logic should be separated out from the user interface anyway.

Reviewed 2 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @knz, and @tbg)


pkg/cli/debug.go, line 1665 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"output-file" instead of "plan" to keep things consistent?

I'd prefer "plan" over "output-file" in general, since it's more specific.


pkg/cli/debug_recover_loss_of_quorum.go, line 176 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could it be confusing to switch between Stdout and Stderr for interactiveness? I think we should consider keeping it simple and say that unless you're fully onboard with y everywhere, you have to provide a file. Otherwise, you can use stdout and there won't be anything there except the file. Or we say the prompt is always on stderr, I assume that would also just work.

Yeah, I'd keep it simple and just stick with stderr for the prompts.


pkg/cli/debug_recover_loss_of_quorum.go, line 256 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We use jsonpb elsewhere, see for example here, we should probably use it here too.

Yes, always use jsonpb. The Protobuf spec mandates a canonical JSON encoding for Protobuf messages, which is incompatible with the Go stdlib's JSON encoding convention. Not only does this break interoperability, but it can also cause data loss for certain datatypes and structures when passing a Protobuf through a stdlib Marshal/Unmarshal cycle.


pkg/cli/debug_recover_loss_of_quorum.go, line 267 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Please refactor all of these methods such that they don't rely on the global state. That is, make a little trampoline where the actual command sets up the parameters and then invokes a method that has all of the business logic (passing all of the configuration directly to it). For inspiration, see

if err != nil {
which is the business logic. The command only has the setup and plumbing as well as the confirmation prompt, but nothing that you may want to unit test. I would like the datadriven tests to not touch any of the config state in cli, i.e. operate purely on the business logic. (Haven't gotten to that part of the review yet).

+1


pkg/cli/debug_recover_loss_of_quorum.go, line 446 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's almost not worth quibbling about this, but as I already brought up elsewhere I'm not sure if it's worth asking for prompts on commands that simply write the desired output file. I'd trust @knz's judgment call on that but I don't think this is what we consistently do elsewhere in the cli. I do think it makes sense to avoid overwriting files but at least the ballast command simply errors out:

tobias@td:~/go/src/github.com/cockroachdb/cockroach$ ./cockroach debug ballast -z 1MiB foo.asd
tobias@td:~/go/src/github.com/cockroachdb/cockroach$ ./cockroach debug ballast -z 1MiB foo.asd
ERROR: file already exists
Failed running "debug ballast"

I don't feel strongly about this here. Just curious what the "right" thing to do is. I do like to keep it simple, and there's enough code here around this handling to have a bug.

I agree, error out and optionally provide a -f,--force flag to overwrite. If this isn't already the case, there should also be a way to opt out of interactive operation entirely (if nothing else, for testability).


pkg/cli/debug_recover_loss_of_quorum.go, line 545 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This looks like a TODO, make it a proper one. As is, this is potentially causing confusion about what the code is or is not doing.

Also, add a TODO (maybe replacing this one?) about writing to logs the fact that the recovery happened. We need to discuss what the right logs are. Ideally you want audit-type logging, but this is the cli so we don't have that luxury. We could stash a file in the store directory or write a store-local key that then translates, on node startup, into a proper log structured log entry (without removing the store key - can't hurt to remember that fact forever). I think that is the right way to do it. Definitely not for this PR, but make sure to file an issue. (cc @erikgrinaker)

Yeah, we'll have to stash this somewhere in the store. I prefer a key over a file because it's more resistant to deliberate or accidental manipulation, but a store-local key isn't ideal because the info will be lost if the node is decommissioned or the store is removed. I think we'll have to buffer it as a store-local key temporarily, and then move it into the audit log and the replicated key-space once the node starts back up.


pkg/cli/debug_recover_loss_of_quorum.go, line 567 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I've been thinking that it might be the better approach to always expect a valid plan. If the plan is invalid and the operator wants to go ahead anyway, they need to edit the plan to make it valid. Haven't fully thought this through but I like how this makes clear which usage of this tool we support and which one we don't. We support valid, CRDB-generated plans only and everything else is an exceptional situation.

+1, let's be paranoid and strict here.


pkg/cli/debug_recover_loss_of_quorum.go, line 640 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We talked about a need to also replicaGC all the "non-designated survivors" (I hope this made it into the RFC, not sure) to ensure smooth operations (if they're still around, they are unavailable, and can hold up requests that accidentally get routed to them). Please make sure this is tracked since this PR isn't implementing it yet (and shouldn't try to). cc @erikgrinaker

Wouldn't the replicaGC queue automatically remove them once it notices that the replicas are no longer part of the descriptor? However, we should make sure that happens asap, rather than waiting for however long it takes the queue to get around to it.

@aliher1911
Copy link
Copy Markdown
Contributor Author

Before we move to any further. I realized how ballooned this PR became with error handling that was added at later stages. I can split that into smaller pieces by collect-plan-apply boundary. At least into multiple commits if that makes any sense. I think some comment history might be lost, but I'll address all existing bits before doing split.
What do you think?
That would also separate datadriven tests into separate commit which was my concern initially as it makes commits non self sufficient.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/cli/debug_recover_loss_of_quorum.go, line 340 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

When would that be the case though, if we expect a replica on a store, store is available, but replica is missing? I think we can try that in follow up where we group replicas before making a decision.

I'm basically saying that I think we can drop the len(deadStoreIDs) check. The len(plan.Updates) == 0 check is currently only run whenlen(deadStoreIDs) == 0, I think it should apply regardless (along with the confirmation).

This would happen when a user runs this and it turns out that no ranges have actually lost quorum.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 45 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

As we now know, there's soma bazel hack in our

Yeah, that seems unfortunate.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 128 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

We only test that cli produced some sensible console output based on missing replicas, but we don't test actual changes. I think we shouldn't test correctness of application here as it is covered by our datadriven tests. I think we should only check that cli accepted file and did "something" which is covered separately.
Dependency on previous steps is not ideal. and would fail if collect or plan fails. Alternative would be to create plan by hand, but that's imho fragile. If we change anything upstream of apply, we'll break this test since the data is out of data, while using previous steps gives up up to date guarantees.
We can bake it all into a single test though. Then failure message would distinguish different failures.
I'm open to suggestions here.

Yes, I agree that we should test correctness in the datadriven tests, and that this should be a basic end-to-end test like it already is.

What I mean is that TestCollectInfo currently just checks that we fetched some JSON data from the stores, and TestGeneratePlan checks that a plan file was generated. It seems like if either of those things fail, then TestApplyPlan would also fail, so TestApplyPlan alone would get us basically the same test coverage as the other tests. We could therefore probably just remove the other tests.

Anyway, your call, just a thought.


pkg/kv/kvserver/loqrecovery/recovery.go, line 233 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

Looks like I need to suppress gogoproto stringer and add custom for it to work by value.

Ah, my bad -- I keep forgetting that value types can't implicitly cast to pointer receivers when they're unaddressable. We totally shouldn't do anything custom here.


pkg/kv/kvserver/loqrecovery/recovery.go, line 595 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

Retry here means you can relaunch binary with apply plan. If you have 3 stores and store 2 failed because of something. You then retry it and first store is already processed so we will have no updates.

Ah, got it -- makes sense.

Copy link
Copy Markdown
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/cli/debug_recover_loss_of_quorum_test.go, line 128 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yes, I agree that we should test correctness in the datadriven tests, and that this should be a basic end-to-end test like it already is.

What I mean is that TestCollectInfo currently just checks that we fetched some JSON data from the stores, and TestGeneratePlan checks that a plan file was generated. It seems like if either of those things fail, then TestApplyPlan would also fail, so TestApplyPlan alone would get us basically the same test coverage as the other tests. We could therefore probably just remove the other tests.

Anyway, your call, just a thought.

I'm now leaning towards making it a single test and then adding structuredevent logging to it in following diff. @tbg maybe you have some bright thoughts about the end to end workflow test for sequence of cli commands? Currently I see 2 options:

  • test per operation, but verify the last one in chain. pros: test are per feature, cons: fails if earlier step fails (but we can make is succeed if prerequisite is not successful), runtime is long as we need 3 node disk based cluster with replication
  • single test. pros: less duplication, runs faster, cons: long test with a bit of logic akin to roachtests.

pkg/kv/kvserver/loqrecovery/recovery.go, line 233 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Ah, my bad -- I keep forgetting that value types can't implicitly cast to pointer receivers when they're unaddressable. We totally shouldn't do anything custom here.

We actually override it alot because of redactability/safety reasons so lots of those structs have Stringer defined on value rather than pointer. I already changed that. It makes it look more consistent with the rest of structs we have.

@tbg tbg requested a review from erikgrinaker December 10, 2021 22:11
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisited all my waiting comments and did a pass through recovery.go. I would need to do another pass to really dig into the logic, I found that I was constantly semi-confused about what I was currently looking at, which I chalk up to Friday night brain but also naming and comments that I had trouble absorbing. I think the pragmatic approach is to revisit all of this in one of the follow-up PRs and to focus in this PR on the cli UX and on the test corpus. I haven't looked through the datadriven tests myself yet, but I saw a comment of Erik's where he suspected there were a number of scenarios we aren't covering yet. Adding those in (stopping short of those that require specialized states, like intents to abort or mismatching applied indexes etc, we can save those for the PRs that handle them) might be the last thing to do here before the merge, but it's probably even better to add these in in a follow-up PR. I'll defer to @erikgrinaker here.
Hope to try this out next week, pretty exciting!

Reviewed 1 of 11 files at r7, 1 of 13 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/cli/debug_recover_loss_of_quorum.go, line 545 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, we'll have to stash this somewhere in the store. I prefer a key over a file because it's more resistant to deliberate or accidental manipulation, but a store-local key isn't ideal because the info will be lost if the node is decommissioned or the store is removed. I think we'll have to buffer it as a store-local key temporarily, and then move it into the audit log and the replicated key-space once the node starts back up.

Lost track, was this done or filed?


pkg/cli/debug_recover_loss_of_quorum.go, line 640 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Wouldn't the replicaGC queue automatically remove them once it notices that the replicas are no longer part of the descriptor? However, we should make sure that happens asap, rather than waiting for however long it takes the queue to get around to it.

The replicaGCQueue does not operate at the scale of user latencies. If a stuck replica is around after the recovery, it may stick around for days if we're really unlucky, and requests routed to it may time out. Also through the lens of stability it just seems better to minimize the possible interactions between the "old" and "new" replicas. And for a third point, we want the metrics to back to healthy the moment the recovery ends, and not "sometime in the future". We don't have to do it now, but is this tracked somewhere?


pkg/cli/debug_recover_loss_of_quorum.go, line 672 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

I contenplated CPut a bit, but if we want to make it work, we'll have to preserve full descriptor all the way to planner and back. While doable, it would be more trouble than just checking existing state by hand. Shouldn't checking Generation be enough in general if it is passed back with update?

Checking Generation should be enough in practice, I would be fine with that. Doesn't have to happen now but make a note of it please.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 50 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

Moved all structs together with datadriven handlers into separate file. Does it matter that it is now a _test.go file but without actual tests?

No, totally fine.


pkg/cli/debug_recover_loss_of_quorum_test.go, line 128 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

I'm now leaning towards making it a single test and then adding structuredevent logging to it in following diff. @tbg maybe you have some bright thoughts about the end to end workflow test for sequence of cli commands? Currently I see 2 options:

  • test per operation, but verify the last one in chain. pros: test are per feature, cons: fails if earlier step fails (but we can make is succeed if prerequisite is not successful), runtime is long as we need 3 node disk based cluster with replication
  • single test. pros: less duplication, runs faster, cons: long test with a bit of logic akin to roachtests.

I'm not sure I understand your last suggestion, Oleg. Erik seems to say "just remove TestGeneratePlan and TestCollectInfo". This seems to make sense to me, since TestApplyPlan covers their ground and then some (once you also incorporate the two or three require.XYZs from the ends of the two tests into TestApplyPlan).

If you are entertaining structured logging, I think the easiest way to do it might be to write a JSON stream to a (flag-specified and defaulting to /dev/null) file in apply-plan, but I'm not sure it's worth the effort at this point.

Rather, what's missing is that we actually get the cluster back. If our rewritten range descriptors trivially triggered a panic on node restart, would we catch this anywhere (haven't reviewed the entire diff, came straight to this comment)? We have TestRemoveDeadReplicas which end-to-end-tests ./cockroach debug remove-dead-replicas, it seems like you could copy a lot of of the ideas there verbatim to make a more powerful end-to-end test. In fact, since your work is also replacing (in the long run) remove-dead-replicas, we would anticipate that source test to vanish someday.


pkg/kv/kvserver/loqrecovery/recovery.go, line 229 at r9 (raw file):

I don't think that shouldn't be possible once we pick the store with the highest commit index

It will still be possible then. We would need to pick the store with the highest AppliedIndex. And even then, we might find that the true descriptor is visible on meta2 and has a replica on a store that didn't know about this yet. This would be a learner, and I think we'd get away with it (the voter wouldn't have the learner in its config, so raft wouldn't immediately crash), but it's a mess.

The simple fix here would be to add 10.


pkg/kv/kvserver/loqrecovery/recovery.go, line 1 at r11 (raw file):

// Copyright 2021 The Cockroach Authors.

I think it might be helpful to split this into files according to the plan | apply stages. There's also a few random things (UpdatableStorage for example) that might be better kept on the side in a utils.go.


pkg/kv/kvserver/loqrecovery/recovery.go, line 49 at r11 (raw file):

			return loqrecoverypb.NodeReplicaInfo{}, err
		}
		err = kvserver.IterateRangeDescriptorsFromDisk(ctx, db, func(desc roachpb.RangeDescriptor) error {

mini nit: FromDisk -> FromStorage?
nit I feel a bit more strongly about: the canonical name for the Reader parameter is reader (or basically anything but db, which is usually the KV DB or the SQL DB).


pkg/kv/kvserver/loqrecovery/recovery.go, line 49 at r11 (raw file):

			return loqrecoverypb.NodeReplicaInfo{}, err
		}
		err = kvserver.IterateRangeDescriptorsFromDisk(ctx, db, func(desc roachpb.RangeDescriptor) error {

if err := ...; err != nil { return err }


pkg/kv/kvserver/loqrecovery/recovery.go, line 78 at r11 (raw file):

}

type updateLocationsMap map[roachpb.NodeID]storeIDSet

Comment here would be nice. As someone new to this code updateLocationsMap tells me nothing.


pkg/kv/kvserver/loqrecovery/recovery.go, line 90 at r11 (raw file):

}

func (m updateLocationsMap) AsMapOfLists() map[roachpb.NodeID][]roachpb.StoreID {

nit: []roachpb.StoreID is a slice, not a list.


pkg/kv/kvserver/loqrecovery/recovery.go, line 102 at r11 (raw file):

type PlanningReport struct {
	// Totals for the planning process.
	TotalReplicas    int

Can you add comments on these? Is TotalReplicas = HealthyReplicas plus any replicas that are members of ranges that have lost quorum? Or also LiveNonSurvivors? So if I have a single range, 5x replicated but three replicas are on dead nodes, but one of the nodes is actually live (surprise), do we get Total=3 Healthy = 0 LiveNonSurvivors=1?

Also, feels weird to use the word "Live" here, I would generally avoid this since "liveness" is a running-cluster concept and here we're only looking at disks. "Present" maybe?


pkg/kv/kvserver/loqrecovery/recovery.go, line 108 at r11 (raw file):

	// Deduced store availability info.
	AvailableStores []roachpb.StoreID

"Available" -> Present to make sure there's no confusion with the concept of "availability" in the LOQ sense?


pkg/kv/kvserver/loqrecovery/recovery.go, line 109 at r11 (raw file):

	// Deduced store availability info.
	AvailableStores []roachpb.StoreID
	MissingStores   []roachpb.StoreID

Here too I'm wondering what "missing" means. Marked as dead? Or stores we expected to see (based on descriptors we've seen) but didn't? I think the latter, but I'm only guessing.


pkg/kv/kvserver/loqrecovery/recovery.go, line 112 at r11 (raw file):

	// Detailed update info.
	PlannedUpdates []PlannedReplicaReport

Isn't this RangePlan? Looking at the struct (which could use field comments btw) it sounds like it's determining for the given Range which Replica to keep and which ones to discard. So it is a plan to recover this range, not a plan that only affects a single replica?


pkg/kv/kvserver/loqrecovery/recovery.go, line 117 at r11 (raw file):

PlannedReplicaReport contains detailed info about changes planned for particular replicas.

Sorry to nitpick on the comment, but when (aeons ago) I said

A fair amount of them [the comments] that exist are sort of saying the obvious thing without explaining the greater scheme of things (the comments were telling me what the next line does, rather than telling me why).

and didn't have an example, this is one - I'm not really any wiser after reading this line and so we might as well remove it and replace it with something more concise such as (sorry, can't resist sneaking a type rename in there)

RangeRecoveryPlan describes how quorum will be restored to a Range by naming the StoreID whose Replica will be reconfigured as the sole voter, as well as the other Replicas (which will be discarded to prevent undesired interactions).


pkg/kv/kvserver/loqrecovery/recovery.go, line 240 at r11 (raw file):

			NextReplicaID: desc.NextReplicaID + 1,
		}
		log.Infof(ctx, "Replica has lost quorum, recovering: %s -> %s", desc, update)

I'm sure Erik and you have been over this, and I don't want to open the can of worms again, but it seems odd to me that this library method logs.


pkg/kv/kvserver/loqrecovery/recovery.go, line 259 at r11 (raw file):

}

// validateReplicaSets evaluates provided set of replicas and an optional deadStoreIDs

"optional" just means that it could be empty, right? It raises the question of what you get if you call it without specifying the dead stores. The comment indicates that if you don't pass the dead stores, you'll get an error back. So what would be the point?


pkg/kv/kvserver/loqrecovery/recovery.go, line 261 at r11 (raw file):

// validateReplicaSets evaluates provided set of replicas and an optional deadStoreIDs
// request and produces consistency info containing:
// availableStores  - all storeIDs for which info was collected, i.e. (barring operator

nice comment!


pkg/kv/kvserver/loqrecovery/recovery.go, line 316 at r11 (raw file):
This comment should say what it marshals to.

Marshal json-encodes the provided message to the Writer.


pkg/kv/kvserver/loqrecovery/recovery.go, line 317 at r11 (raw file):

// MarshalAndWriteResults serializes plan and write into file or stdout.
func MarshalAndWriteResults(results protoutil.Message, writer io.Writer) error {

nit: the canonical (following vendor/github.com/gogo/protobuf/jsonpb/jsonpb.go) def here would be func Marshal(out io.Writer, pb protoutil.Message) error { // ... }

I get confused by all of the different marshalers we have in our codebase so take that with a grain of salt.

You may also entertain the idea of having MarshalJSON() and UnmarshalJSON() methods on the structs themselves (with the impl shared). It would be a little less confusing that way, right now we're exposing a ... generic marshaling function in this package that folks might be tempted to use, for completely unrelated purposes, instead of going through protoutil.


pkg/kv/kvserver/loqrecovery/recovery.go, line 330 at r11 (raw file):

}

// PrepareStoreReport contains information about all prepared changes for the

PreparedStoreReport

same nit about the word "Report" when it's really a Plan.

I would've called this StorePlan, might be missing something obvious here.


pkg/kv/kvserver/loqrecovery/recovery.go, line 334 at r11 (raw file):

type PrepareStoreReport struct {
	// MissingStores contains a set of stores that node on which this update is performed
	// is expected to have according to update plan, but are not provided in config.

This means the plan will not be able to be applied, right? Maybe point that out.

I see that we're then erroring out here:

if len(prepReport.MissingStores) > 0 {
return errors.Newf("stores %s expected on the node but no paths were provided",
joinStoreIDs(prepReport.MissingStores))
}

I would've expected this check to live in the library part, i.e. here, for example via (p *StorePlan) Err() error, and taking another step back I'm unsure why this doesn't instead already immediately result in an error returned from PrepareUpdateReplicas (btw weird that PrepareUpdateReplicas returns a Store-related struct, I think it should just be MakeStorePlan?)


pkg/kv/kvserver/loqrecovery/recovery.go, line 346 at r11 (raw file):

// to lazily create batches and perform lifecycle management of the store
// and the batch.
type UpdatableStore struct {

I think if we polished this a bit we'd arrive at something simpler. Not sure it's a good idea at this point though. This will do.


pkg/kv/kvserver/loqrecovery/recovery.go, line 392 at r11 (raw file):

	OldReplicaID roachpb.ReplicaID

	// Update status.

This comment isn't particularly helpful. How is it already updated? Is this idempotency stuff (in case someone runs the update twice, the second report will have this set)? Mention that if that's it.


pkg/kv/kvserver/loqrecovery/recovery.go, line 395 at r11 (raw file):

	AlreadyUpdated bool

	// Update details.

Is this comment useful?


pkg/kv/kvserver/loqrecovery/recovery.go, line 405 at r11 (raw file):

// PrepareUpdateReplicas creates batch updates from plan to provided stores.
// All found discrepancies are collected into list of warnings.

I don't see a list of warnings?


pkg/cli/testdata/quorum/learners_lose, line 44 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

Current test prints yaml of actual plan protobufs here. It also reuses ReplicaDescriptor specs. What you are saying is we could actually drop store/node id from top level plan struct and always pull that from the replica desc?

Without being really deeply tuned into this PR, I think yes, that is generally how I would tend to do it - whenever duplication happens in a struct it is a source of bugs or at least confusion. So I would gravitate towards getters when no error handling is necessary. For example, a recovery decision for a range has to contain a replica, so a NodeID() getter could just read x.SurvivingReplicaDescriptor.NodeID.

@tbg tbg self-requested a review December 10, 2021 22:12
Copy link
Copy Markdown
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/cli/debug_recover_loss_of_quorum_test.go, line 128 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm not sure I understand your last suggestion, Oleg. Erik seems to say "just remove TestGeneratePlan and TestCollectInfo". This seems to make sense to me, since TestApplyPlan covers their ground and then some (once you also incorporate the two or three require.XYZs from the ends of the two tests into TestApplyPlan).

If you are entertaining structured logging, I think the easiest way to do it might be to write a JSON stream to a (flag-specified and defaulting to /dev/null) file in apply-plan, but I'm not sure it's worth the effort at this point.

Rather, what's missing is that we actually get the cluster back. If our rewritten range descriptors trivially triggered a panic on node restart, would we catch this anywhere (haven't reviewed the entire diff, came straight to this comment)? We have TestRemoveDeadReplicas which end-to-end-tests ./cockroach debug remove-dead-replicas, it seems like you could copy a lot of of the ideas there verbatim to make a more powerful end-to-end test. In fact, since your work is also replacing (in the long run) remove-dead-replicas, we would anticipate that source test to vanish someday.

Looking on TestRemoveDeadReplicas it seem to cover everything from command line down to various cases of "survivability" of data. In our case we have tests that check selection and update using generated data which is fast and easily extendable, but it doesn't assert if cluster can actually recover after restart.
What I found about existing test is that if recovery doesn't happen (e.g. we just lose quorum across the board) the test will just hang because liveness ranges are broken and cluster start never finishes. So we only test that some corner case where majority of ranges recover somehow and then we don't have upreplication for some particular range.


pkg/kv/kvserver/loqrecovery/recovery_test.go, line 23 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Will we be writing additional test cases for this? I haven't gone through the exercise of actually enumerating them, but there seems to be lots of other possible scenarios here, including futile ones where we want to make sure that we fail safely.

With respect to this PR, it covers cases that are supported. Once we have range check and also committed index analyzed they need to be added as well. Currently we test what original recovery supported.


pkg/kv/kvserver/loqrecovery/recovery.go, line 49 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

mini nit: FromDisk -> FromStorage?
nit I feel a bit more strongly about: the canonical name for the Reader parameter is reader (or basically anything but db, which is usually the KV DB or the SQL DB).

I don't want to rename kvserver functions as a part of this PR tbh. FromStorage is coming from our store in there.


pkg/kv/kvserver/loqrecovery/recovery.go, line 102 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you add comments on these? Is TotalReplicas = HealthyReplicas plus any replicas that are members of ranges that have lost quorum? Or also LiveNonSurvivors? So if I have a single range, 5x replicated but three replicas are on dead nodes, but one of the nodes is actually live (surprise), do we get Total=3 Healthy = 0 LiveNonSurvivors=1?

Also, feels weird to use the word "Live" here, I would generally avoid this since "liveness" is a running-cluster concept and here we're only looking at disks. "Present" maybe?

Rather Total=2 Healthy = 0 LiveNonSurvivors=1, but total and healthy would also include ranges that we don't touch because we don't need recovery.


pkg/kv/kvserver/loqrecovery/recovery.go, line 112 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this RangePlan? Looking at the struct (which could use field comments btw) it sounds like it's determining for the given Range which Replica to keep and which ones to discard. So it is a plan to recover this range, not a plan that only affects a single replica?

This is only meant to be printed. It replaced previous []string type report that was collected during planning to present to user for confirmation. It is now capturing data in struct and let cli do the printing in whatever form deemed necessary.


pkg/kv/kvserver/loqrecovery/recovery.go, line 117 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…
PlannedReplicaReport contains detailed info about changes planned for particular replicas.

Sorry to nitpick on the comment, but when (aeons ago) I said

A fair amount of them [the comments] that exist are sort of saying the obvious thing without explaining the greater scheme of things (the comments were telling me what the next line does, rather than telling me why).

and didn't have an example, this is one - I'm not really any wiser after reading this line and so we might as well remove it and replace it with something more concise such as (sorry, can't resist sneaking a type rename in there)

RangeRecoveryPlan describes how quorum will be restored to a Range by naming the StoreID whose Replica will be reconfigured as the sole voter, as well as the other Replicas (which will be discarded to prevent undesired interactions).

That't not really a plan, I'll try to expand comment to make it more clear.


pkg/kv/kvserver/loqrecovery/recovery.go, line 240 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm sure Erik and you have been over this, and I don't want to open the can of worms again, but it seems odd to me that this library method logs.

It doesn't leak out anywhere when running it, but makes debugging things much easier in tests and standalone where you can just point log to a file and check details. I found it helpful, but if that's completely off I can strip them all and keep them in local stashes or something.


pkg/kv/kvserver/loqrecovery/recovery.go, line 259 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"optional" just means that it could be empty, right? It raises the question of what you get if you call it without specifying the dead stores. The comment indicates that if you don't pass the dead stores, you'll get an error back. So what would be the point?

If you don't pass anything then it would just deduce dead stores, but if you pass a deadStores then it must match what is found exactly or it would err.
On the higher level, cli will ask for a confirmation if you didn't provide dead stores, but won't if what you provided matches what is found.


pkg/kv/kvserver/loqrecovery/recovery.go, line 316 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This comment should say what it marshals to.

Marshal json-encodes the provided message to the Writer.

It makes more sense to keep that inside cli itself as we don't marshal in the recovery code itself. Or just copy paste inline as it seem less controversial.


pkg/kv/kvserver/loqrecovery/recovery.go, line 317 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: the canonical (following vendor/github.com/gogo/protobuf/jsonpb/jsonpb.go) def here would be func Marshal(out io.Writer, pb protoutil.Message) error { // ... }

I get confused by all of the different marshalers we have in our codebase so take that with a grain of salt.

You may also entertain the idea of having MarshalJSON() and UnmarshalJSON() methods on the structs themselves (with the impl shared). It would be a little less confusing that way, right now we're exposing a ... generic marshaling function in this package that folks might be tempted to use, for completely unrelated purposes, instead of going through protoutil.

I'll just get rid of it. I think it shrunk too much from the original design.


pkg/kv/kvserver/loqrecovery/recovery.go, line 330 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

PreparedStoreReport

same nit about the word "Report" when it's really a Plan.

I would've called this StorePlan, might be missing something obvious here.

Well it is really a report. Plan is a proto message that could be marshalled and passed around, while this is just glorified string.


pkg/kv/kvserver/loqrecovery/recovery.go, line 334 at r11 (raw file):

PrepareUpdateReplicas
creates batches with changes per store, but doesn't commit them. After all the batches are successfully prepared, user is asked to confirm destructive action. It is also where we ask user to confirm discrepancy between expected stores and provided stores.
I don't think we should immediately err if not all stores were provided here. It may be down how stores are mounted when we run k8s operators to do recovery which may not necessarily the same as normal case. I think it gives us some flexibility on how to run recovery.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I think we're getting there.

In the make-plan report, consider adding a summary line for "Discarded dead replicas" too, for completeness. Also, should we consider including the dead replicas in the total replica count too, or add a separate line for "Total replicas known"? As a user, I might be confused that things don't add up -- here's an example, where I might wonder why 149+16 != 157:

Total replicas analyzed: 157
Healthy replicas found:  149
Ranges without quorum:   8
Discarded live replicas: 0

Recovering range r67:/Table/55/1/-9222246136947044152 promoting replica r4 at store s1 to r5. Discarding replicas: (n5,s5):3,(n2,s2):2
Recovering range r68:/Table/55/1/-9222246136946980632 promoting replica r1 at store s1 to r4. Discarding replicas: (n5,s5):3,(n2,s2):2
Recovering range r95:/Table/55/1/-9222246136947431320 promoting replica r1 at store s4 to r5. Discarding replicas: (n5,s5):4,(n2,s2):2
Recovering range r65:/Table/55/1/-9222246136947367800 promoting replica r1 at store s4 to r4. Discarding replicas: (n5,s5):3,(n2,s2):2
Recovering range r96:/Table/55/1/-9222246136947302264 promoting replica r1 at store s4 to r4. Discarding replicas: (n5,s5):3,(n2,s2):2
Recovering range r97:/Table/55/1/-9222246136947238744 promoting replica r1 at store s4 to r4. Discarding replicas: (n5,s5):3,(n2,s2):2
Recovering range r66:/Table/55/1/-9222246136947173208 promoting replica r1 at store s4 to r4. Discarding replicas: (n5,s5):3,(n2,s2):2
Recovering range r85:/Table/55/1/-9222246136947109688 promoting replica r1 at store s4 to r4. Discarding replicas: (n5,s5):3,(n2,s2):2

Maybe something like this might be better (same dataset):

Total replicas known: 267
Live replicas analyzed: 157
Discarding dead replicas: 110
Discarding live replicas: 0
Ranges without quorum: 8

Also, the make-plan output says "promoting replica r1 at store s1 to r4", but apply-plan calls this replica 14, not 4.

Finally, there are a fair bit of typos and grammar issues in the CLI help text and such. That isn't a big deal in internal comments and such, but I think we should avoid it in user-facing text. Do you mind doing a pass over this? I'd be happy to do a pass as well, if you'd like.

I haven't looked through the datadriven tests myself yet, but I saw a comment of Erik's where he suspected there were a number of scenarios we aren't covering yet. Adding those in (stopping short of those that require specialized states, like intents to abort or mismatching applied indexes etc, we can save those for the PRs that handle them) might be the last thing to do here before the merge, but it's probably even better to add these in in a follow-up PR. I'll defer to @erikgrinaker here.

Yeah, let's write additional tests later. Makes sense to defer that until the recovery logic is finalized.

Reviewed 3 of 13 files at r10, 11 of 13 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/cli/debug_recover_loss_of_quorum_test.go, line 176 at r13 (raw file):

	).Connect(ctx)
	if err != nil {
		t.Fatal(err)

nit: why not require.NoError throughout?


pkg/kv/kvserver/loqrecovery/recovery_test.go, line 23 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

With respect to this PR, it covers cases that are supported. Once we have range check and also committed index analyzed they need to be added as well. Currently we test what original recovery supported.

Yeah, that's fine, we can add cases later. There's still many cases we should test, even with this initial logic. As a couple of basic examples: what happens if there's only a learner replica left? What happens if one replica underwent a split but another didn't? There's lots of scenarios like these.


pkg/kv/kvserver/loqrecovery/utils.go, line 50 at r13 (raw file):

// to lazily create batches and perform lifecycle management of the store
// and the batch.
type UpdatableStore struct {

It's unclear to me why this exists. It's only instantiated and used once, for PrepareUpdateReplicas and CommitReplicaChanges, but those functions could just as well take map[roachpb.StoreID]storage.ReadWriter containing batches. What does this type add that we can't already do with existing types?


pkg/kv/kvserver/loqrecovery/recovery.go, line 229 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't think that shouldn't be possible once we pick the store with the highest commit index

It will still be possible then. We would need to pick the store with the highest AppliedIndex. And even then, we might find that the true descriptor is visible on meta2 and has a replica on a store that didn't know about this yet. This would be a learner, and I think we'd get away with it (the voter wouldn't have the learner in its config, so raft wouldn't immediately crash), but it's a mess.

The simple fix here would be to add 10.

Right. How about collecting all of the canonical range descriptors from meta2, and if it has quorum, use those to pick the new replica IDs? Adding 10 seems like an ok fallback.

Copy link
Copy Markdown
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to calculate total replicas known without collecting all the data about all replicas in temporary maps. I'll drop statistics altogether except for basic counter.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/cli/debug_recover_loss_of_quorum.go, line 545 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Lost track, was this done or filed?

Added in #73785


pkg/kv/kvserver/loqrecovery/recovery_test.go, line 23 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, that's fine, we can add cases later. There's still many cases we should test, even with this initial logic. As a couple of basic examples: what happens if there's only a learner replica left? What happens if one replica underwent a split but another didn't? There's lots of scenarios like these.

We'll get various inconsistent states. Do we want to record that behaviour in tests as expectations? I can do that. Not sure what is the benefit if they have to be dropped subsequently as we'll get a better solution.


pkg/kv/kvserver/loqrecovery/recovery.go, line 229 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Right. How about collecting all of the canonical range descriptors from meta2, and if it has quorum, use those to pick the new replica IDs? Adding 10 seems like an ok fallback.

In #73756 it takes highest nextReplicaID out of all surviving replicas as next ID.

@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 20, 2021

Did a runthrough. Cool stuff!

In make-plan, when I was providing the dead stores via the flag, I still get the output

Discovered dead stores from provided files: s5, s6

which is now slightly confusing since I also provided them. Perhaps this should say

Discovered dead stores from provided files: s5, s6 (this matches --dead-store-ids)

When I provide an additional bogus store that is not in the system as dead:

./cockroach debug recover make-plan --dead-store-ids 5,6,123 *.info.json > foo.json

This isn't pointed out, should this cause an error? I suppose it shouldn't; there could be situations in which a store died but because of constraints or dumb luck it isn't actually a member anywhere in the surviving system. So maybe this is working as intended. When I provide a live store as dead, the tool correctly refuses to make a plan.

Another comment about

Discovered dead stores from provided files: s5, s6

In the case in which this is auto-detected, it might make sense to augment the following prompt with a request to verify that these are indeed the stores that the user thinks are dead forever. As is, it says Proceed with plan creation [y/N] but it's not really clear if the user should do anything and will probably blindly hit y.

After confirming, this output

Plan created
Run apply-plan on node n1, store(s) s1
Run apply-plan on node n2, store(s) s2
Run apply-plan on node n3, store(s) s3
Run apply-plan on node n4, store(s) s4

Is perhaps slightly vague? These are the follow-up instructions, but I'm not 100% sure this is how readers will understand it. Perhaps this would be better:

Plan created
To complete recovery, distribute the plan to the below nodes and invoke `apply-plan`:
- node n1, store(s) s1
- ...

When I don't provide the -s flag to apply-plan, I get misleading output:

$ roachprod ssh tobias-tpcc:1-4 -- ./cockroach debug recover apply-plan plan.json
tobias-tpcc: ./cockroach debug recover a... 4/4
   1: No updates planned on this node.
   2: No updates planned on this node.
   3: No updates planned on this node.
   4: No updates planned on this node.

This should really tell me that I need to point it at the store.

roachprod ssh tobias-tpcc:1-4 -- ./cockroach debug recover apply-plan -p y -s /mnt/data1/cockroach plan.json

worked.

The messages could be a wee bit more helpful:

Replica 3 for range 251:/Table/59/1/105/55550 will be promoted to 14 with peer replica(s) removed: (n5,s5):1,(n6,s6):2

It would be nice to know which store Replica 3 is on. Perhaps just being consistent with the formatting would help:

Replica (n3,s3):3 for range 251:/Table/59/1/105/55550 will be promoted to 14 with peer replica(s) removed: (n5,s5):1,(n6,s6):2

I have no idea what "promoted to 14" means, so that should probably be fixed as well. Does this mean (n3,s3):14?

I'd encourage you to react to my feedback above in a separate PR (and I think Erik also has a pass incoming, feel free to wait for that).

Also, as a last point, when restarted the surviving nodes, n2 quickly crashed:

F211220 14:45:58.659335 1620 kv/kvserver/store_remove_replica.go:98 ⋮ [n2,replicaGC,s2,r396/14:‹/Table/59/1/{79/470…-81/3892}›] 167  replica descriptor's ID has changed (14 >= 5)

(from the replicaGCQueue). I think this might've been on me; I messed up the restarting and briefly started some of the "dead" stores again too. But we definitely want a roachtest like that, which should have an issue filed as well. Documenting my steps below, they should translate pretty straightforwardly into a roachtest.

roachprod create -n 6 tobias-tpcc
roachprod put tobias-tpcc cockroach-linux-2.6.32-gnu-amd64 cockroach
roachprod start tobias-tpcc
roachprod run tobias-tpcc:1 -- ./cockroach workload fixtures import tpcc --warehouses 300 --checks=false
roachprod run tobias-tpcc:6 ./cockroach workload run -- tpcc --warehouses=300 --duration 300s {pgurl:1-6}
roachprod wipe tobias-tpcc:5,6
roachprod stop tobias-tpcc:1-4
roachprod ssh tobias-tpcc:1-4 -- ./cockroach debug recover collect-info -s /mnt/data1/cockroach info.json
roachprod get tobias-tpcc:1-4 info.json
./cockroach debug recover make-plan --dead-store-ids 5,6 -p y *.info.json > plan.json
roachprod put tobias-tpcc:1-4 -- plan.json
roachprod ssh tobias-tpcc:1-4 -- ./cockroach debug recover apply-plan -p y -s /mnt/data1/cockroach plan.json
roachprod start tobias-tpcc:1-4
roachprod run tobias-tpcc:6 ./cockroach workload run -- tpcc --warehouses=300 --duration 300s {pgurl:1-4}
roachprod monitor tobias-tpcc:1-4 --oneshot
roachprod run tobias-tpcc:1 -- ./cockroach node decommission --insecure 5 6

The decommissioning bit at the end is interesting. None of the replicas in the "active" cluster will consider themselves as living on n5 and n6, but decommissioning uses the (stale) meta2 copy to determine when it is done. The only reason decommissioning finishes is because all of the recovered replicas will perform up-replication (thus fixing up their meta2 record with something not containing s5 nor s6). It worked, but we may want to be deliberate about fixing up the meta2 records at some point in the future, as there may be circumstances in which the recovered range won't consider up-replicating (for example, if the constraints are changed in response to the outage and no replacement is necessary any more).

@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 20, 2021

Also, it just occurred to me that the above would make for a great demo on the kv joint weekly. Would you be up for giving it early next year? If so consider penciling in.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to calculate total replicas known without collecting all the data about all replicas in temporary maps. I'll drop statistics altogether except for basic counter.

Ok, your call. I think this might be useful in reassuring operators that it's doing something expected, but I won't push for it.

Reviewed 8 of 8 files at r14, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/kv/kvserver/loqrecovery/recovery_test.go, line 23 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

We'll get various inconsistent states. Do we want to record that behaviour in tests as expectations? I can do that. Not sure what is the benefit if they have to be dropped subsequently as we'll get a better solution.

Let's do it in a later PR, once the recovery logic has been revised. But yeah, we should have tests for sad paths too -- I'd argue those are just as useful/important as the happy paths, if not more so.


pkg/kv/kvserver/loqrecovery/recovery.go, line 229 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

In #73756 it takes highest nextReplicaID out of all surviving replicas as next ID.

Couldn't that still cause issues if the dead replicas have already applied changes to the canonical descriptor that the surviving replicas haven't learned about yet? But let's discuss over on #73756.

Copy link
Copy Markdown
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tobias-tpcc: ./cockroach debug recover a... 4/4
   1: No updates planned on this node.

This is interesting, I think you have a cockroach-data directory where you run the command, otherwise it would fail to open store and fail. I think message should include read store and node id(s). Would it be an issue in prod? I guess if client env is so messy to have random directories laying around that would confuse those commands.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)


pkg/kv/kvserver/loqrecovery/recovery.go, line 229 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Couldn't that still cause issues if the dead replicas have already applied changes to the canonical descriptor that the surviving replicas haven't learned about yet? But let's discuss over on #73756.

This is true! I think we might keep adding 10 even in that case. I'll update that PR.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I likely did. Should we be comparing the cluster ID which is part of StoreID and thread it through the plans, etc? I don't think this is particularly urgent but it might be worth filing away as a follow-up issue.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)

Previously, if range quorum was lost because too many replicas becoming
permanently unavailable, cockroach could either be recovered from
a backup, or debug-unsafe-remove-replicas command applied on each node
to recover to a potentially inconsistent state.
This command has some limitations on how it decides which replica is
the most up to date one and in not usually recommended.
This diff introduces the new process for the recovery. It doesn't
guarantee consistency after recovery, but it provides additional checks
and lays the foundation for further recovery work.

Release note: None
@aliher1911
Copy link
Copy Markdown
Contributor Author

bors r=erikgrinaker,tbg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 21, 2021

Build succeeded:

@craig craig bot merged commit 54debf3 into cockroachdb:master Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants