Skip to content

kvserver: replica might crash evaluating a condition as leaseholder, …#78438

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
shralex:stabilize_lease_transfer1
Mar 24, 2022
Merged

kvserver: replica might crash evaluating a condition as leaseholder, …#78438
craig[bot] merged 1 commit intocockroachdb:masterfrom
shralex:stabilize_lease_transfer1

Conversation

@shralex
Copy link
Copy Markdown
Contributor

@shralex shralex commented Mar 24, 2022

…when it isn't

In #77841, we only remove the
leaseholder if there is a VOTER_INCOMING replica in the same JOINT config.
But it is also possible that a node A was the lesaeholder, and then the lease
gets transferred to node B, who tries to remove A. Since B isn't getting removed,
it is fine to enter a JOINT config without a VOTER_INCOMING node. But A
might still believe its the leaseholder, and this might cause it to
crash when it finds out that its in a JOINT config and there is no INCOMING node.
In this case, we can just continue, since an error will be thrown in replica_raft.go

This resolves #77937

Release note: None

Release justification: fixes flakiness caused by #77841

@shralex shralex requested a review from a team as a code owner March 24, 2022 17:19
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15
Copy link
Copy Markdown
Contributor

aayushshah15 commented Mar 24, 2022

Something that's unclear to me at the moment is why the restore2TB roachtest would ever downreplicate a range to begin with. If I understand correctly, we should only be hitting this Fatal if we're downreplicating a range. @adityamaru, do you know?

nvm, I see that the restore roachtest isn't actually hitting the Fatal added in #77841

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shralex)


pkg/kv/kvserver/replica_command.go, line 1213 at r1 (raw file):

	if voterIncomingTarget == (roachpb.ReplicaDescriptor{}) {
		// Couldn't find a VOTER_INCOMING target. This should not happen since we only go into the

We'll want to update this comment with a discussion on how we can reach this point and what we expect to happen after we return nil in the case where we aren't actually the leaseholder and in the case where we are (mistakenly).


pkg/kv/kvserver/replica_command.go, line 1216 at r1 (raw file):

		// JOINT config if the leaseholder is being removed, when there is a VOTER_INCOMING replica.
		// Killing the leaseholder to force lease transfer.
		log.Warningf(ctx, "no VOTER_INCOMING to transfer lease to. "+

Let's actually make this an Infof since it's not unexpected to end up here. And can we update the message to describe the situation more precisely? A rule of thumb for log messages is to ask the question: will logging this lead to L2 pages from users of Cockroach who closely monitor their logs for "concerning" looking logs?

…when it isn't

In cockroachdb#77841, we only remove the
leaseholder if there is a VOTER_INCOMING replica in the same JOINT config.
But it is also possible that a node A was the lesaeholder, and then the lease
gets transferred to node B, who tries to remove A. Since B isn't getting removed,
it is fine to enter a JOINT config without a VOTER_INCOMING node. But A
might still believe its the leaseholder, and this might cause it to
crash when it finds out that its in a JOINT config and there is no INCOMING node.
In this case, we can just continue, since an error will be thrown in replica_raft.go

Release note: None

Release justification: fixes flakiness caused by cockroachdb#77841
@shralex shralex force-pushed the stabilize_lease_transfer1 branch from 2a04669 to 7cb3bf0 Compare March 24, 2022 21:16
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @shralex)

@shralex
Copy link
Copy Markdown
Contributor Author

shralex commented Mar 24, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 24, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql/importer: TestMultiNodeExportStmt failed

4 participants