Skip to content

storage: properly initialize RHS replica post-split#14305

Merged
spencerkimball merged 1 commit intomasterfrom
spencerkimball/fix-post-split-lease-apply
Mar 22, 2017
Merged

storage: properly initialize RHS replica post-split#14305
spencerkimball merged 1 commit intomasterfrom
spencerkimball/fix-post-split-lease-apply

Conversation

@spencerkimball
Copy link
Copy Markdown
Member

@spencerkimball spencerkimball commented Mar 22, 2017

Ensure that the execution pathway used after a new lease is applied to a
range is invoked after splitting a range on the newly created lease for
the RHS of the split.

Fixes #13876
Closes #13875


This change is Reviewable

Ensure that the execution pathway used after a new lease is applied to a
range is invoked after splitting a range on the newly created lease for
the RHS of the split.

Fixes #13876
Closes #13875
@spencerkimball spencerkimball requested a review from bdarnell March 22, 2017 13:40
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Mar 22, 2017

Reviewed 4 of 4 files at r1.
Review status: :shipit: all files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/store.go, line 1914 at r1 (raw file):

	// the replica according to whether it holds the lease. This enables
	// the PushTxnQueue. Note that we pass in an empty lease for prevLease.
	rightRng.leasePostApply(ctx, rightLease, rightReplicaID, &roachpb.Lease{})

leasePostApply does a lot of stuff, and this isn't the right place for all of it. In particular it clears the timestamp cache (which we carefully copied over from the LHS above). For today, let's do a quick fix and just do if (iAmTheLeaseHolder) { r.pushTxnQueue.Enable() }. After cutting this week's beta we can figure out how to integrate this more effectively (It may work to do leasePostApply earlier, before the timestamp cache copy, but this needs more analysis).


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/store.go, line 1914 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

leasePostApply does a lot of stuff, and this isn't the right place for all of it. In particular it clears the timestamp cache (which we carefully copied over from the LHS above). For today, let's do a quick fix and just do if (iAmTheLeaseHolder) { r.pushTxnQueue.Enable() }. After cutting this week's beta we can figure out how to integrate this more effectively (It may work to do leasePostApply earlier, before the timestamp cache copy, but this needs more analysis).

I really want to use the same execution pathway to init the lease-related aspects of the replica post-split.

I think this is still the right place for these changes. For example, we only clear the timestamp cache if we're not the replica that owns the lease, so we actually will not be clearing it on the leaseholding replica after copying from the LHS. For the replicas which don't own the lease, we would expect the copying that was done in splitPostApply to have copied an empty timestamp cache, so clearing is appropriate and a noop in any case.

The leasePostApply method actually does other useful stuff already for us, such as potentially transferring the raft leadership to match leaseholder for the RHS (there might have been one underway on the LHS already).


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/store.go, line 1914 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I really want to use the same execution pathway to init the lease-related aspects of the replica post-split.

I think this is still the right place for these changes. For example, we only clear the timestamp cache if we're not the replica that owns the lease, so we actually will not be clearing it on the leaseholding replica after copying from the LHS. For the replicas which don't own the lease, we would expect the copying that was done in splitPostApply to have copied an empty timestamp cache, so clearing is appropriate and a noop in any case.

The leasePostApply method actually does other useful stuff already for us, such as potentially transferring the raft leadership to match leaseholder for the RHS (there might have been one underway on the LHS already).

It clears the timestamp cache if we don't have the lease, but it sets the low-water mark if we do. When I wrote the original comment I was afraid that this would advance the low-water timestamp, but now I think it's safe because it uses the backdated timestamp of the new lease. But my point is that there are subtle interactions there and I'd like more time to review everything before we start calling leasePostApply.

The leadership transfer won't do anything because the new range doesn't yet have a leader, and this needs to be considered in conjunction with the solution to #9727.


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

Does #9727 still require a fix? Pete's contention is nothing else needs to be done for that issue.

@bdarnell
Copy link
Copy Markdown
Contributor

He said that, but reopened it after he closed it. I don't know whether there's anything left to do or not. (after reviewing that and #9550, I think it may be a test-only issue)

@bdarnell
Copy link
Copy Markdown
Contributor

Anyway, we've built the beta candidate now, so it's getting a little late for the quick fix. I'll take another look at this PR and see if it's safe to use leasePostApply.

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/store.go, line 1914 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It clears the timestamp cache if we don't have the lease, but it sets the low-water mark if we do. When I wrote the original comment I was afraid that this would advance the low-water timestamp, but now I think it's safe because it uses the backdated timestamp of the new lease. But my point is that there are subtle interactions there and I'd like more time to review everything before we start calling leasePostApply.

The leadership transfer won't do anything because the new range doesn't yet have a leader, and this needs to be considered in conjunction with the solution to #9727.

By invoking this execution pathway, I'm trying to avoid creating more work down the road in the best case and new bugs in the worst case.

I understand the hesitancy to sign off without more time spent reviewing leasePostApply. I have reviewed it in detail, however and will list my findings here for posterity:

There are two bools which control the logic: iAmTheLeaseHolder and leaseChangingHands. leaseChangingHands is always true for this invocation from splitPostApply. iAmTheLeaseHolder is true for one of N replicas.

leasePostApply has the following effects (parentheticals indicate which replica of the RHS post-split is effected):

  • sets tsCache low water mark (applies to lease holder)
  • clears the replica's request counts (applies to lease holder, but is a noop, because stats are empty)
  • gossips first range (doesn't apply; RHS will never be first range)
  • clears tsCache (applies to all non-leaseholders)
  • disables the push txn queue (applies to all non-leaseholders)
  • calls maybeTransferRaftLeadership (applies to non-leaseholders, but is a noop as there will be no Raft leader)
  • gossips store capacity (doesn't apply as previous lease is empty)
  • gossips system config and node liveness (applies to lease holder – this is all idempotent work)
  • enables the push txn queue (applies to leaseholder)

Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

@spencerkimball spencerkimball merged commit 440ab56 into master Mar 22, 2017
@spencerkimball spencerkimball deleted the spencerkimball/fix-post-split-lease-apply branch March 22, 2017 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants