kv: support multiple server-side retries by nvb · Pull Request #73717 · cockroachdb/cockroach

nvb · 2021-12-11T20:38:34Z

This commit lifts the restriction that a single batch does not perform more than one server-side refresh. This restriction was added in c6f9b92, the commit that originally introduced server-side retries. As mentioned in this comment, the limitation was added to avoid an infinity retry loop for non-transactional requests (or 1PC requests) containing overlapping writes. This commit avoids this issue in a different way — by detecting a WriteTooOld loop directly and terminating it.

It is not yet possible to construct a batch that will perform multiple server-side retries, but the following commit will introduce such situations.

This commit lifts the restriction that a single batch does not perform more than one server-side refresh. This restriction was added in c6f9b92, the commit that originally introduced server-side retries. As mentioned in [this comment](cockroachdb@c6f9b92#diff-271aa5bd8272fa39bd683ecbf27f1906bafca61dc350ff229c6627fbde697fbaR5238), the limitation was added to avoid an infinity retry loop for non-transactional requests (or 1PC requests) containing overlapping writes. This commit avoids this issue in a different way — by detecting a WriteTooOld loop directly and terminating it. It is not yet possible to construct a batch that will perform multiple server-side retries, but the following commit will introduce such situations.

cockroach-teamcity · 2021-12-11T20:38:40Z

This change is

…eshes This addresses a TODO and makes extensions to the test easier.

Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 from above, because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. 2. ignoring reason #1, the request timestamp assigned by the leaseholder is equal to its local uncertainty limit and is guaranteed to lead the written_timestamp of all writes served by the range at the time the request arrived at the leaseholder node. However, this timestamp is still not guaranteed to lead the commit timestamp of all of these writes. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. ---- Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors can always retry on the server, so these errors never bubble up to the client that initiated the request. However, before this commit, this wasn't actually possible because we did not support server-side retries of `ReadWithinUncertaintyIntervalError`. The second part of this commit adds support for this. This new support allows us to make the guarantee we want about non-transactional requests never returning these errors to the client, even though they use uncertainty intervals. It also serves as a performance optimization for transactional requests, which now benefit from this new capability. There's some complexity around supporting this form of server-side retry, because it must be done above latching instead of below, but recent refactors in cockroachdb#73557 and cockroachdb#73717 have made this possible to support cleanly. ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

andreimatei

Can you expand the commit message say more about why multiple refreshes remain impossible for now, and what else is about to change?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nvanbenschoten)

pkg/kv/kvserver/replica_evaluate.go, line 609 at r1 (raw file):

			// that create a WriteTooOld error, then it will be rejected, fall back to
			// the non-1PC evaluation path, and succeed.
			if _, ok := prevPErr.GetDetail().(*roachpb.WriteTooOldError); ok {

mmm I find this prevPErr argument fragile. What if there are other errors in between two WriteTooOld errors? Perhaps that's not possible because of latches and whatnot, but I'm not sure we should be relying on it. Did you consider having the caller explicitly track something about the non-transactional batches?

pkg/kv/kvserver/replica_test.go, line 10009 at r1 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	// TODO(andrei): make each subtest use its own testContext so that they don't

well go all the way and make all tests use key "a"

nvb requested a review from andreimatei December 11, 2021 20:38

nvb requested a review from a team as a code owner December 11, 2021 20:38

kv: use separate testContext per subtest in TestReplicaServersideRefr…

ab131be

…eshes This addresses a TODO and makes extensions to the test easier.

nvb mentioned this pull request Dec 13, 2021

kv: give non-transactional requests uncertainty intervals #73732

Merged

irfansharif requested a review from AlexTalks December 13, 2021 19:55

andreimatei reviewed Dec 16, 2021

View reviewed changes

nvb closed this Feb 10, 2022

nvb deleted the nvanbenschoten/serverSideUncertaintyRefreshReal branch February 17, 2022 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: support multiple server-side retries#73717

kv: support multiple server-side retries#73717
nvb wants to merge 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/serverSideUncertaintyRefreshReal

nvb commented Dec 11, 2021

Uh oh!

cockroach-teamcity commented Dec 11, 2021

Uh oh!

andreimatei left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nvb commented Dec 11, 2021

Uh oh!

cockroach-teamcity commented Dec 11, 2021

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants