kvserver: simplify requestCanProceed and leaseGoodToGo#72978
kvserver: simplify requestCanProceed and leaseGoodToGo#72978craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
e224ebe to
914530d
Compare
914530d to
00c9da5
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/replica.go, line 1349 at r1 (raw file):
// and we only hold a read-only lock. if shouldExtend { _ = r.store.stopper.RunAsyncTask(ctx, "maybe-extend-async", func(ctx context.Context) {
How do you feel about this double-async-task-to-promote-mutex approach? It kind of feels like an anti-pattern to me, but I also see why you did it given the structure of this code. Are we concerned about the potential for an unbounded number of async tasks being launched all at once when an expiration-based lease needs to be extended?
pkg/kv/kvserver/replica.go, line 1374 at r1 (raw file):
st := r.leaseStatusForRequestRLocked(ctx, now, reqTS) // Check if request is below the GC threshold and if so, error out.
It's surprising to me that we're now checking the GC threshold before checking the lease status. With "strict GC enforcement" enabled, the GC threshold is a function of the lease status, so it feels like we should be checking the GC threshold after the lease. Was there a reason to make this change?
Was it so we could short-circuit on the next two checks?
I found myself having to make a change to this method in cockroachdb#72972 and was really unsure about when this returned a nonzero lease status and whether that even mattered. Also, the code was hard to grok due to the large number of "fallthrough" if-else statements. Now it follows a standard pattern of checking and immediately returning errors as early as possible, and prefers early returns for the success cases as well. The method now clearly states what its contract (now) is: it will either return a valid lease status and no error (leaseholder read), or a zero lease status and no error (follower read or lease check skipped), or a zero lease status and an error (can't serve request, for example invalid lease, out of bounds, or pending merge, etc). Leases previously returned a half-populated lease status up for the purpose of communicating the sequence to use during proposal. However, it was cleaner to return a zero status and to repurpose the existing special casing for leases in `evalAndPropose` path to pull the sequence from the lease request. I checked the callers to make sure that none of the now absent lease statuses introduce a behavior change. The TL;DR is that callers always ignored all but the valid lease status. ---- The simplest caller is `executeAdminBatch` which doesn't even assign the returned lease status: https://github.com/cockroachdb/cockroach/blob/55720d0a4c0dd8030cd19645461226d173eefc2e/pkg/kv/kvserver/replica_send.go#L759-L775 The second caller is `executeReadOnlyBatch`: https://github.com/cockroachdb/cockroach/blob/86e2edece61091944d8575336b5de74fafc28b35/pkg/kv/kvserver/replica_read.go#L42-L46 `st` is used only if `err == nil`, in two places: https://github.com/cockroachdb/cockroach/blob/64aadfec4b3877371b6f71bf5f9aa83bd97aa515/pkg/kv/kvserver/observedts/limit.go#L42-L50 This will return early in case of a zero lease status, i.e. is a no-op in this case. The second use is storing `st` into the `endCmds`: https://github.com/cockroachdb/cockroach/blob/86e2edece61091944d8575336b5de74fafc28b35/pkg/kv/kvserver/replica_read.go#L152 which is accessed only under a similar lease validity check: https://github.com/cockroachdb/cockroach/blob/55720d0a4c0dd8030cd19645461226d173eefc2e/pkg/kv/kvserver/replica_send.go#L1019-L1021 The third caller is `executeWriteBatch`: https://github.com/cockroachdb/cockroach/blob/455cdddc6d75c03645f486b22970e5c6198a8d56/pkg/kv/kvserver/replica_write.go#L81-L89 We handled `ComputeLocalUncertaintyLimit` already. Other than that, `st` is passed to `evalAndPropose`. In that method, since it is the write path, we shouldn't see follower reads and so the remaining case are leases, where the status will be zero and the special-casing in that method takes over. ---- This commit also simplifies leaseGoodToGoRLocked. It was repeatedly reassigning `status` which I found hard to follow. Release note: None
00c9da5 to
7cbdb8c
Compare
tbg
left a comment
There was a problem hiding this comment.
TFTR, RFAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica.go, line 1349 at r1 (raw file):
Didn't love it, thanks for the nudge - I think this is better now.
Are we concerned about the potential for an unbounded number of async tasks being launched all at once when an expiration-based lease needs to be extended?
We aren't in either version of the code, as
cockroach/pkg/kv/kvserver/replica_range_lease.go
Lines 1282 to 1295 in a5e32e1
pkg/kv/kvserver/replica.go, line 1374 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's surprising to me that we're now checking the GC threshold before checking the lease status. With "strict GC enforcement" enabled, the GC threshold is a function of the lease status, so it feels like we should be checking the GC threshold after the lease. Was there a reason to make this change?
Was it so we could short-circuit on the next two checks?
Yeah, with the check later we end up with a more complicated control flow again, which I was trying to avoid. I think there is also an argument that when any replica can't serve the request due to GCThreshold, that this should take precedence over the lease, as the leaseholder likely also can't serve it, so we're just wasting less time. A counter-argument could be that in principle there is no reason why the GC threshold has to be consistent across replicas, one could imagine specializing some replicas to serve historical reads and keeping everyone else's histories small.
However, I also don't like the early returns here much. The control flow I really want is one where all early returns are on error and success means reaching the end of the method. Gave this a do-over; a little bit of the fallthrough branching that I didn't like is back but I think it is manageable & I prefer it over the version you commented on here.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/replica.go, line 1298 at r2 (raw file):
postRUnlock := func() {} r.mu.RLock() defer func() {
Sorry to ask, but this doesn't incur a heap allocation, right?
tbg
left a comment
There was a problem hiding this comment.
TFTR!
bors r=nvanbenschoten
CI is red due to what is likely an unrelated failure, which I have also seen in another PR (and notified someone about).
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/replica.go, line 1298 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Sorry to ask, but this doesn't incur a heap allocation, right?
Good question, no it doesn't seem like it. I checked using goescape and also scrutinized the manual output of go build -gcflags='-m'. Interestingly postRUnlock does not occur in the output at all, I hope that just means it doesn't escape and doesn't mean I'm holding it wrong (plenty of other vars in replica.go are mentioned).
|
Build succeeded: |
PR cockroachdb#72978 unintentionally introduced a small change: on a follower read, we would pass the lease status to `checkTSAboveGCThreshold`. If this lease status was `VALID` but not owned by the recipient replica (i.e. this was a follower read), `checkTSAboveGCThreshold` could return the implied GCThreshold (i.e. `now()-gcttl`) instead of the lowest possible TTL. This made `TestStrictGCEnforcement` flaky since that test verifies that a historical read goes through on the recent recipient of a lease (which may not have applied the lease yet), who would serve it via a follower read and would thus error out a read that the test expected to go through. Fixes cockroachdb#73123. Release note: None
73153: kvserver: de-flake TestStrictGCEnforcement r=erikgrinaker a=tbg PR #72978 unintentionally introduced a small change: on a follower read, we would pass the lease status to `checkTSAboveGCThreshold`. If this lease status was `VALID` but not owned by the recipient replica (i.e. this was a follower read), `checkTSAboveGCThreshold` could return the implied GCThreshold (i.e. `now()-gcttl`) instead of the lowest possible TTL. This made `TestStrictGCEnforcement` flaky since that test verifies that a historical read goes through on the recent recipient of a lease (which may not have applied the lease yet), who would serve it via a follower read and would thus error out a read that the test expected to go through. Fixes #73123. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
I found myself having to make a change to this method in #72972 and was
really unsure about when this returned a nonzero lease status and
whether that even mattered. Also, the code was hard to grok due to
the large number of "fallthrough" if-else statements.
Now it follows a standard pattern of checking and immediately returning
errors as early as possible, and prefers early returns for the success
cases as well.
The method now clearly states what its contract (now) is: it will either
return a valid lease status and no error (leaseholder read), or a zero
lease status and no error (follower read or lease check skipped), or a
zero lease status and an error (can't serve request, for example invalid
lease, out of bounds, or pending merge, etc). Leases previously returned
a half-populated lease status up for the purpose of communicating the
sequence to use during proposal. However, it was cleaner to return a
zero status and to repurpose the existing special casing for leases
in
evalAndProposepath to pull the sequence from the lease request.I checked the callers to make sure that none of the now absent lease
statuses introduce a behavior change. The TL;DR is that callers always
ignored all but the valid lease status.
The simplest caller is
executeAdminBatchwhich doesn't even assignthe returned lease status:
cockroach/pkg/kv/kvserver/replica_send.go
Lines 759 to 775 in 55720d0
The second caller is
executeReadOnlyBatch:cockroach/pkg/kv/kvserver/replica_read.go
Lines 42 to 46 in 86e2ede
stis used only iferr == nil, in two places:cockroach/pkg/kv/kvserver/observedts/limit.go
Lines 42 to 50 in 64aadfe
This will return early in case of a zero lease status, i.e. is a no-op
in this case.
The second use is storing
stinto theendCmds:cockroach/pkg/kv/kvserver/replica_read.go
Line 152 in 86e2ede
which is accessed only under a similar lease validity check:
cockroach/pkg/kv/kvserver/replica_send.go
Lines 1019 to 1021 in 55720d0
The third caller is
executeWriteBatch:cockroach/pkg/kv/kvserver/replica_write.go
Lines 81 to 89 in 455cddd
We handled
ComputeLocalUncertaintyLimitalready. Other than that,stis passed to
evalAndPropose. In that method, since it is the writepath, we shouldn't see follower reads and so the remaining case are
leases, where the status will be zero and the special-casing in that
method takes over.
This commit also simplifies leaseGoodToGoRLocked. It was repeatedly
reassigning
statuswhich I found hard to follow.Release note: None