kv: move range lease checks and transfers below latching#59086
kv: move range lease checks and transfers below latching#59086craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
1affffb to
35a8b4b
Compare
tbg
left a comment
There was a problem hiding this comment.
Need to finish this review, but flushing what I have since it won't be today
Reviewed 47 of 63 files at r2, 12 of 14 files at r3, 3 of 5 files at r4, 4 of 4 files at r5, 4 of 4 files at r7, 28 of 28 files at r8, 1 of 1 files at r9, 17 of 17 files at r10.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica.go, line 1143 at r11 (raw file):
rSpan, err := keys.Range(ba.Requests) if err != nil { return st, err
Is this intentional throughout this method? I would prefer if we returned the explicit zero value on errors and only returned anything else for a good reason and with lots of comments.
pkg/kv/kvserver/replica.go, line 1156 at r11 (raw file):
// Is the request fully contained in the range? // NB: we only need to check that the request is in the Range's key bounds // at proposal time, not at application time, because the spanlatch manager
Perhaps add that we're holding latches at proposal time.
pkg/kv/kvserver/replica.go, line 1165 at r11 (raw file):
// Is the lease valid? if !ba.ReadConsistency.RequiresReadLease() { st.Now = now
The way we start out with an empty lease status and then stuff things into it makes me nervous. Let's explicitly initialize it fully at each site.
I'm particularly confused by this line here, though. We just don't put the lease into the lease status?
pkg/kv/kvserver/replica_follower_read.go, line 63 at r11 (raw file):
repDesc, err := r.getReplicaDescriptorRLocked() if err != nil { log.Warningf(ctx, "%s", err)
?
pkg/kv/kvserver/replica_range_lease.go, line 786 at r11 (raw file):
// how a Subsume request sets the FreezeStart time and pauses closed // timestamps during evaluation and communicates this back up using the // LocalResult.
Seems better. The AdminX function of methods should access the internals as little as possible, as they are mostly just ways to run arbitrary code in the right places, and ideally code that operates on a sane abstraction of Replica. Won't quite happen, but every step counts.
pkg/kv/kvserver/replica_range_lease.go, line 934 at r11 (raw file):
// the specified timestamp. The method will return the lease status, // along with an error indicating whether any of these conditions is // unsatisfied.
Explicitly say that the lease status is either zero or fully populated.
pkg/kv/kvserver/replica_range_lease.go, line 1021 at r11 (raw file):
// leaseGoodToGo is like leaseGoodToGoRLocked, but does not require the // replica read lock to be held.
Heh, I'd have said "but will acquire the replica read lock". As is, it sounds like this doesn't need it for some reason.
pkg/kv/kvserver/replica_range_lease.go, line 1114 at r11 (raw file):
// renewed the lease, so we return the handle to block on renewal. // Otherwise, we don't need to wait for the extension and simply // ignore the returned handle (whose channel is buffered) and continue.
Stale comment about async renewal.
pkg/kv/kvserver/replica_read.go, line 48 at r11 (raw file):
// Limit the transaction's maximum timestamp using observed timestamps. // TODO(nvanbenschoten): now that we've pushed this down here, consider // keeping the "local max timestamp" on the stack and never modifying the
🤝
pkg/kv/kvserver/batcheval/cmd_push_txn.go, line 127 at r11 (raw file):
} now := cArgs.EvalCtx.Clock().Now() // TODO(nvanbenschoten): remove this limitation. But when doing so,
What is the limitation being removed?
pkg/testutils/testcluster/testcluster.go, line 785 at r10 (raw file):
// We are going to stop the current leaseholder's liveness, wait for it to // expire, and then issue a request to the target in hopes that it grabs the
#59059 once it's in should allow you to make this more efficient.
35a8b4b to
0f17064
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewed 10 of 30 files at r21.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica.go, line 1143 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this intentional throughout this method? I would prefer if we returned the explicit zero value on errors and only returned anything else for a good reason and with lots of comments.
+1. If this is intended, put a method comment about it
pkg/kv/kvserver/replica.go, line 1136 at r21 (raw file):
defer func() { if shouldExtend { r.maybeExtendLeaseAsync(ctx, st)
Consider lifting this defer into a wraper function.
Also, it doesn't seem very nice to me that this shouldExtend is modified elsewhere (in fact plumbed over from God knows where) and checked in the defer. Are the respective checks expensive? Can't we do it only inside maybeExtendLeaseAsync()? We could capture a now if you're worried about the clock reading.
pkg/kv/kvserver/replica.go, line 1156 at r21 (raw file):
// Is the request fully contained in the range? // NB: we only need to check that the request is in the Range's key bounds // at proposal time, not at application time, because the spanlatch manager
s/proposal time/evaluation time ?
pkg/kv/kvserver/replica_range_lease.go, line 1285 at r21 (raw file):
// maybeExtendLeaseAsync attempts to extend the expiration-based lease // asynchronously, if doing so is deemed beneficial by an earlier call
this talks about "an earlier call to shouldExtend...", but this method checks that again. Why?
pkg/kv/kvserver/replica_range_lease.go, line 1293 at r21 (raw file):
return } if log.V(2) {
log.ExpensiveLogEnabled(ctx, 2)
pkg/kv/kvserver/replica_send.go, line 443 at r21 (raw file):
} // g's latches will be dropped, but it retains its spot in lock wait-queues. return r.concMgr.HandleWriterIntentError(ctx, g, t.LeaseSequence, t)
was it particularly necessary to plumb the lease sequence through WriteIntentError? We still have the latches here (right?), so can't we just read the lease from the replica? If we can, I think that'd be better - cause these errors unfortunately are part of the "KV API" (or is WriteIntentError terminated on the server?).
pkg/kv/kvserver/replica_send.go, line 514 at r21 (raw file):
defer r.mu.RUnlock() if r.canServeFollowerReadRLocked(ctx, ba, pErr.GoError()) { // Follower read possible. Retry command.
s/command/request
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 30 at r20 (raw file):
rs ImmutableRangeState, _ roachpb.Header, _ roachpb.Request, latchSpans, _ *spanset.SpanSet, ) { // TransferLease must not run concurrently with any other command so it uses
s/command/request
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 38 at r20 (raw file):
// in the range, even through the only key the TransferLease actually writes // to is the RangeLeaseKey. This guarantees that it conflicts with any other // command because every command must declare at least one addressable key.
s/command/request
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 45 at r20 (raw file):
// would only be safe if we also accounted for clock uncertainty in all read // latches so that any read that may need to observe state on the new // leaseholder gets blocked. We actually already do this for transactional
well this is how I assumed the latching would work when I read it. I think it's kind of a biggie; these transfers are already disruptive enough. Could we really not to this? Perhaps there's an easy hack we could do to have non-transaction requests conflict with the latches taken here, but not transactional requests?
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 70 at r21 (raw file):
prevLease, _ := cArgs.EvalCtx.GetLease() // Forward the lease's start time to a current clock reading. Now
I think the "now" here will confuse people, since it doesn't refer to something done in this function. How about s/Now that/At this point
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 75 at r21 (raw file):
// serviced by the leaseholder before it stopped serving requests // (i.e. before the TransferLease request acquired latches). newLease := args.Lease
consider poisoning ars.Lease by setting it to nil or smth
pkg/kv/kvserver/batcheval/declare.go, line 88 at r20 (raw file):
// guarantees that the caller conflicts with any other command because every // command must declare at least one addressable key, which is tested against // in TestRequestsSerializeWithSubsume.
TestRequestsSerializeWithAllKeys now
pkg/kv/kvserver/batcheval/declare.go, line 91 at r20 (raw file):
func declareAllKeys(latchSpans *spanset.SpanSet) { // NOTE: we don't actually know what the end key of the Range will // be at the time of command evaluation (see ImmutableRangeState),
s/command/request
pkg/kv/kvserver/batcheval/declare.go, line 93 at r20 (raw file):
// be at the time of command evaluation (see ImmutableRangeState), // so we simply declare a latch over the entire keyspace. This may // extend beyond the Range, but this is ok for the purpose of
please make sure that this fact - that the key span returned by these declare functions can extend to MaxKey - is reflected in some comment on this declare*Keys interface
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 448 at r21 (raw file):
// request is supposed to hold latches while evaluating in the first place. func (g *Guard) AssertLatches() { if shouldAcquireLatches(g.Req) && !g.HoldingLatches() {
how come this wasn't needed before?
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 563 at r12 (raw file):
// restart for uncertainty if the push succeeds and we come back and // read. h.Timestamp.Forward(req.Txn.MaxTimestamp)
hmm so how did this work, given that MaxTimestamp is in the future? What happened to such a push?
pkg/roachpb/errors.proto, line 527 at r21 (raw file):
RangeFeedRetryError rangefeed_retry = 38; IndeterminateCommitError indeterminate_commit = 39; InvalidLeaseError invalid_lease_error = 40;
make a note that this error doesn't go to clients. Maybe group it with the others above that are like this.
Btw you really had to make this a pErr, right?
pkg/testutils/testcluster/testcluster.go, line 785 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
#59059 once it's in should allow you to make this more efficient.
That PR is now in. If we can avoid PauseAllHeartbeatsForTest, that'd be awesome.
pkg/util/hlc/timestamp.go, line 375 at r20 (raw file):
// LessEq returns whether the receiver is less than or equal to the parameter. func (t ClockTimestamp) LessEq(s ClockTimestamp) bool { return Timestamp(t).LessEq(Timestamp(s)) }
why did you delete this?
0f17064 to
98d35ed
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/replica.go, line 1143 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
+1. If this is intended, put a method comment about it
Done.
pkg/kv/kvserver/replica.go, line 1165 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The way we start out with an empty lease status and then stuff things into it makes me nervous. Let's explicitly initialize it fully at each site.
I'm particularly confused by this line here, though. We just don't put the lease into the lease status?
Done.
pkg/kv/kvserver/replica.go, line 1136 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Consider lifting this defer into a wraper function.
Also, it doesn't seem very nice to me that thisshouldExtendis modified elsewhere (in fact plumbed over from God knows where) and checked in the defer. Are the respective checks expensive? Can't we do it only insidemaybeExtendLeaseAsync()? We could capture anowif you're worried about the clock reading.
The issue is the locking. We first detect that we need to extend when holding the RLock but need to hold the exclusive Lock (and check again) in maybeExtendLeaseAsync. See the method's comment.
pkg/kv/kvserver/replica.go, line 1156 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/proposal time/evaluation time ?
Done.
pkg/kv/kvserver/replica_follower_read.go, line 63 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
?
Removed.
pkg/kv/kvserver/replica_range_lease.go, line 786 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Seems better. The
AdminXfunction of methods should access the internals as little as possible, as they are mostly just ways to run arbitrary code in the right places, and ideally code that operates on a sane abstraction ofReplica. Won't quite happen, but every step counts.
Yeah, I think this will be nice. I'm going to do it in a follow-up PR though.
pkg/kv/kvserver/replica_range_lease.go, line 934 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Explicitly say that the lease status is either zero or fully populated.
Done.
pkg/kv/kvserver/replica_range_lease.go, line 1021 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Heh, I'd have said "but will acquire the replica read lock". As is, it sounds like this doesn't need it for some reason.
Done.
pkg/kv/kvserver/replica_range_lease.go, line 1114 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Stale comment about async renewal.
Done.
pkg/kv/kvserver/replica_range_lease.go, line 1285 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this talks about "an earlier call to shouldExtend...", but this method checks that again. Why?
Because we dropped the lock and re-aquired. I added a comment.
pkg/kv/kvserver/replica_range_lease.go, line 1293 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
log.ExpensiveLogEnabled(ctx, 2)
Done.
pkg/kv/kvserver/replica_read.go, line 48 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
🤝
It came out nicely! #57077
pkg/kv/kvserver/replica_send.go, line 443 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
was it particularly necessary to plumb the lease sequence through
WriteIntentError? We still have the latches here (right?), so can't we just read the lease from thereplica? If we can, I think that'd be better - cause these errors unfortunately are part of the "KV API" (or isWriteIntentErrorterminated on the server?).
Discussed in person.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 30 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/command/request
Done.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 38 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/command/request
Done.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 45 at r20 (raw file):
I think it's kind of a biggie
This is nothing new (see here), so I don't want to open this can of worms right now. There's enough in-flux.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 70 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think the "now" here will confuse people, since it doesn't refer to something done in this function. How about s/Now that/At this point
Good point. Done.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 75 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider poisoning
ars.Leaseby setting it to nil or smth
Done.
pkg/kv/kvserver/batcheval/cmd_push_txn.go, line 127 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
What is the limitation being removed?
That we can push a transaction above the current clock time. This is allowed now that the timestamp cache can lead the current clock.
pkg/kv/kvserver/batcheval/declare.go, line 88 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
TestRequestsSerializeWithAllKeys now
Done.
pkg/kv/kvserver/batcheval/declare.go, line 91 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/command/request
Done.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 448 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
how come this wasn't needed before?
We should have always had it. I think there are a few operations that can now hit a InvalidLeaseError that couldn't previously hit any concurrency retry errors so they never hit this path. But maybe they could have hit a MergeInProgressError? Maybe I should backport this just to be safe.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 563 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
hmm so how did this work, given that MaxTimestamp is in the future? What happened to such a push?
MaxTimestamp used to be limited to the observed timestamp before this point. Now it's after. So we can't just use it. But this approach is safer anyway, for the reasons discussed in this comment.
pkg/roachpb/errors.proto, line 527 at r21 (raw file):
make a note that this error doesn't go to clients. Maybe group it with the others above that are like this.
Included on the type itself.
Btw you really had to make this a pErr, right?
Unfortunately, yes.
pkg/testutils/testcluster/testcluster.go, line 785 at r10 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
That PR is now in. If we can avoid
PauseAllHeartbeatsForTest, that'd be awesome.
Done. Thanks @lunevalex!
pkg/util/hlc/timestamp.go, line 375 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why did you delete this?
It became unused so the linter started yelling.
598c204 to
686c73b
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 563 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
MaxTimestamp used to be limited to the observed timestamp before this point. Now it's after. So we can't just use it. But this approach is safer anyway, for the reasons discussed in this comment.
(drive-by comment)
I'm quite confused by this -- should I be re-reading the synthetic timestamp RFC to remind me what is going on here wrt not using observed timestamp.
Specifically, I'm confused by the comment below that observed timestamps are not applicable across lease changes. I am assuming that was true before too, but here we are evaluating on this node as the leaseholder, since we are on this node's concurrency manager. If the lease moved we would need to evaluate using the other node's concurrency manager, and would use the observed timestamp for the other node.
686c73b to
9170b2e
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 563 at r12 (raw file):
I'm confused by the comment below that observed timestamps are not applicable across lease changes. I am assuming that was true before too
Yes, that was true before. See this comment for an explanation.
but here we are evaluating on this node as the leaseholder, since we are on this node's concurrency manager
That is one of the things this PR is changing. Since we now check the lease (and only compute the local uncertainty limit) below latches and the concurrency manager, this may not always be true. However, in practice, we clear the lock table on non-leaseholder replicas, and so we typically shouldn't end up here unless we're on a leaseholder replica.
So the comment is trying to explain why we can't use an observed timestamp directly, because it may not be the same as the local uncertainty limit. If we had the local uncertainty limit at this point, we could use it, but we don't and so this approach is a straightforward approximation of it that doesn't lose much.
I should update the comment, now that the terminology around "local uncertainty limits" has improved. Other than that, does that explanation make sense?
|
Friendly ping on this. |
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 13 of 69 files at r22, 5 of 20 files at r23, 2 of 33 files at r24.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @lunevalex, @sumeerbhola, and @tbg)
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 563 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm confused by the comment below that observed timestamps are not applicable across lease changes. I am assuming that was true before too
Yes, that was true before. See this comment for an explanation.
but here we are evaluating on this node as the leaseholder, since we are on this node's concurrency manager
That is one of the things this PR is changing. Since we now check the lease (and only compute the local uncertainty limit) below latches and the concurrency manager, this may not always be true. However, in practice, we clear the lock table on non-leaseholder replicas, and so we typically shouldn't end up here unless we're on a leaseholder replica.
So the comment is trying to explain why we can't use an observed timestamp directly, because it may not be the same as the local uncertainty limit. If we had the local uncertainty limit at this point, we could use it, but we don't and so this approach is a straightforward approximation of it that doesn't lose much.
I should update the comment, now that the terminology around "local uncertainty limits" has improved. Other than that, does that explanation make sense?
Does the local uncertainty limit account for the incoming lease, as discussed in the comment you linked to, or not?
Either way, I think what you are saying is that we don't hold latches here, so we can't use the local uncertainty limit. Observing the local clock is a safe alternative (though pessimistic alternative, since we are not utilizing what had been observed in previous request evaluations for this txn on this node), and also takes into account the lease start time. Correct?
Updating the comment sounds good.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @lunevalex, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica.go, line 1136 at r21 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The issue is the locking. We first detect that we need to extend when holding the RLock but need to hold the exclusive Lock (and check again) in
maybeExtendLeaseAsync. See the method's comment.
ok
pkg/kv/kvserver/replica_range_lease.go, line 1299 at r24 (raw file):
r.mu.Lock() defer r.mu.Unlock() // Check shouldExtendLeaseRLocked again, because we dropped the lock.
nit: "we dropped the lock" is not very clear. I'd say that perhaps others raced since we checked.
pkg/util/hlc/timestamp.go, line 375 at r20 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It became unused so the linter started yelling.
consider leaving it and tricking the linter with a _ = LessEq. Weak people like myself are tempted to take shortcuts when they need this function and it doesn't exist.
Remove mention of LeaseStatus and rewrap.
9170b2e to
c44b357
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @andreimatei, @lunevalex, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_range_lease.go, line 1299 at r24 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: "we dropped the lock" is not very clear. I'd say that perhaps others raced since we checked.
Done.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 563 at r12 (raw file):
Yes, what you said is all correct.
Updating the comment sounds good.
Done.
pkg/util/hlc/timestamp.go, line 375 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider leaving it and tricking the linter with a
_ = LessEq. Weak people like myself are tempted to take shortcuts when they need this function and it doesn't exist.
Done.
|
bors r+ |
|
Build succeeded: |
Needed for cockroachdb#61137. This commit updates the manner through which lease transfers (through `LeaseTransferRequest`) and range merges (through `SubsumeRequest`) handle the "transfer of power" from their outgoing leaseholder to their incoming leaseholder. Specifically, it updates the handling of these requests in order to rationalize the interaction between their evaluation and the closing of timestamps through the closed timestamp side-transport. It is now clearer when and how these requests instruct the outgoing leaseholder to relinquish its ability to advance the closed timestamp and, as a result, now possible for the requests to query and operate on the maximum closed timestamp published by the outgoing leaseholder. For lease transfers, this commit begins by addressing an existing TODO to push the revocation of the outgoing lease out of `AdminTransferLease` and into the evaluation of `LeaseTransferRequest` through a new `RevokeLease` method on the `EvalContext`. Once a lease is revoked, the side-transport will no longer be able to advance the closed timestamp under it. This was made possible by cockroachdb#59086 and was suggested by @tbg during the code review. We generally like to keep replica state changes out of "admin" requests themselves, which are intended to coordinate changes through individual non-admin requests. Admin requests generally don't even need to evaluate on a leaseholder (though they try to), so having them perform any state changes is fragile. For range merges, this commit removes the `MaybeWatchForMerge` flag from the `LocalResult` returned by `SubsumeRequest` and replaces it with a `WatchForMerge` method on the `EvalContext`. This allows the `SubsumeRequest` to launch the range merge watcher goroutine during it evaluation, which the side-transport checks for to see whether a leaseholder can advance its closed timestamp. In doing so, the `SubsumeRequest` is able to pause closed timestamps when it wants and is also able to observe and return the maximum closed timestamp _after_ the closed timestamp has stopped advancing. This is a stark improvement over the approach with the original closed timestamp system, which required a herculean effort in cockroachdb#50265 to make correct. With these refactors complete, the closed timestamp side-transport should have a much easier and safer time checking whether a given leaseholder is able to advance its closed timestamp. Release justification: Necessary for the safety of new functionality.
Needed for cockroachdb#61137. This commit updates the manner through which lease transfers (through `LeaseTransferRequest`) and range merges (through `SubsumeRequest`) handle the "transfer of power" from their outgoing leaseholder to their incoming leaseholder. Specifically, it updates the handling of these requests in order to rationalize the interaction between their evaluation and the closing of timestamps through the closed timestamp side-transport. It is now clearer when and how these requests instruct the outgoing leaseholder to relinquish its ability to advance the closed timestamp and, as a result, now possible for the requests to query and operate on the maximum closed timestamp published by the outgoing leaseholder. For lease transfers, this commit begins by addressing an existing TODO to push the revocation of the outgoing lease out of `AdminTransferLease` and into the evaluation of `LeaseTransferRequest` through a new `RevokeLease` method on the `EvalContext`. Once a lease is revoked, the side-transport will no longer be able to advance the closed timestamp under it. This was made possible by cockroachdb#59086 and was suggested by @tbg during the code review. We generally like to keep replica state changes out of "admin" requests themselves, which are intended to coordinate changes through individual non-admin requests. Admin requests generally don't even need to evaluate on a leaseholder (though they try to), so having them perform any state changes is fragile. For range merges, this commit removes the `MaybeWatchForMerge` flag from the `LocalResult` returned by `SubsumeRequest` and replaces it with a `WatchForMerge` method on the `EvalContext`. This allows the `SubsumeRequest` to launch the range merge watcher goroutine during it evaluation, which the side-transport checks for to see whether a leaseholder can advance its closed timestamp. In doing so, the `SubsumeRequest` is able to pause closed timestamps when it wants and is also able to observe and return the maximum closed timestamp _after_ the closed timestamp has stopped advancing. This is a stark improvement over the approach with the original closed timestamp system, which required a herculean effort in cockroachdb#50265 to make correct. With these refactors complete, the closed timestamp side-transport should have a much easier and safer time checking whether a given leaseholder is able to advance its closed timestamp. Release justification: Necessary for the safety of new functionality.
61221: kv: sync lease transfers and range merges with closed timestamp side-transport r=nvanbenschoten a=nvanbenschoten Needed for the safety of #61137. This commit updates the manner through which lease transfers (through `LeaseTransferRequest`) and range merges (through `SubsumeRequest`) handle the "transfer of power" from their outgoing leaseholder to their incoming leaseholder. Specifically, it updates the handling of these requests in order to rationalize the interaction between their evaluation and the closing of timestamps through the closed timestamp side-transport. It is now clearer when and how these requests instruct the outgoing leaseholder to relinquish its ability to advance the closed timestamp and, as a result, now possible for the requests to query and operate on the maximum closed timestamp published by the outgoing leaseholder. For lease transfers, this commit begins by addressing an existing TODO to push the revocation of the outgoing lease out of `AdminTransferLease` and into the evaluation of `LeaseTransferRequest` through a new `RevokeLease` method on the `EvalContext`. Once a lease is revoked, the side-transport will no longer be able to advance the closed timestamp under it. This was made possible by #59086 and was suggested by @tbg during the code review. We generally like to keep replica state changes out of "admin" requests themselves, which are intended to coordinate changes through individual non-admin requests. Admin requests generally don't even need to evaluate on a leaseholder (though they try to), so having them perform any state changes is fragile. For range merges, this commit removes the `MaybeWatchForMerge` flag from the `LocalResult` returned by `SubsumeRequest` and replaces it with a `WatchForMerge` method on the `EvalContext`. This allows the `SubsumeRequest` to launch the range merge watcher goroutine during it evaluation, which the side-transport checks for to see whether a leaseholder can advance its closed timestamp. In doing so, the `SubsumeRequest` is able to pause closed timestamps when it wants and is also able to observe and return the maximum closed timestamp _after_ the closed timestamp has stopped advancing. This is a stark improvement over the approach with the original closed timestamp system, which required a herculean effort in #50265 to make correct. With these refactors complete, the closed timestamp side-transport should have a much easier and safer time checking whether a given leaseholder is able to advance its closed timestamp. Release justification: Necessary for the safety of new functionality. 61237: util/log: ensure that all channel logs are displayed with `-show-logs` r=tbg a=knz When `-show-logs` is specified, the `log.Scope` becomes a no-op and the default configuration in the `log` package is used. This is the only time ever when the default configuration is used. Prior to this patch, only the logging for the DEV channel would make its way to the standard error (and the test output) in that case. This was unfortunate, since the intent (as spelled out in a comment already) was to display everything. This patch fixes that. Release justification: non-production code changes Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
This commit reworks the timestamp that pushers use when pushing intents to synthetic timestamps. Usually, as of cockroachdb#59086, a pusher limits the timestamp that it pushes conflicting intents to using the local HLC clock. This is based on the assumption that it will be able to ignore any intent above the local HLC clock's reading using observed timestamps and a local uncertainty limit. However, this argument only holds if we expect to be able to use a local uncertainty limit when we return to read the pushed intent. Notably, local uncertainty limits can not be used to ignore intents with synthetic timestamps that would otherwise be in a reader's uncertainty interval. This is because observed timestamps do not apply to intents/values with synthetic timestamps. So if we know that we will be pushing an intent to a synthetic timestamp, we don't limit the value to a clock reading from the local clock. This came out of some exploration of kvnemesis. I don't think this was actually causing an issue, but it seemed like a potential source of an infinite push + conflict loop. Release note (bug fix): Read-write contention on GLOBAL tables no longer has a potential to thrash without making progress.
This commit reworks the timestamp that pushers use when pushing intents to synthetic timestamps. Usually, as of cockroachdb#59086, a pusher limits the timestamp that it pushes conflicting intents to using the local HLC clock. This is based on the assumption that it will be able to ignore any intent above the local HLC clock's reading using observed timestamps and a local uncertainty limit. However, this argument only holds if we expect to be able to use a local uncertainty limit when we return to read the pushed intent. Notably, local uncertainty limits can not be used to ignore intents with synthetic timestamps that would otherwise be in a reader's uncertainty interval. This is because observed timestamps do not apply to intents/values with synthetic timestamps. So if we know that we will be pushing an intent to a synthetic timestamp, we don't limit the value to a clock reading from the local clock. This came out of some exploration of kvnemesis. I don't think this was actually causing an issue, but it seemed like a potential source of an infinite push + conflict loop. Release note (bug fix): Read-write contention on GLOBAL tables no longer has a potential to thrash without making progress.
59863: cli/zip: issue per-node requests concurrently r=tbg a=knz Requested by @ajwerner and @andreimatei First commit from #64081. Release note (cli change): `cockroach debug zip` now attempts to pull data from multiple nodes concurrently, up to 15 nodes at a time. This change is meant to accelerate the data collection when a cluster contains multiple nodes. This behavior can be negated with the new command-line flag `--sequential`. 63800: kv: properly handle intent pushes to synthetic timestamps r=nvanbenschoten a=nvanbenschoten This commit reworks the timestamp that pushers use when pushing intents to synthetic timestamps. Usually, as of #59086, a pusher limits the timestamp that it pushes conflicting intents to using the local HLC clock. This is based on the assumption that it will be able to ignore any intent above the local HLC clock's reading using observed timestamps and a local uncertainty limit. However, this argument only holds if we expect to be able to use a local uncertainty limit when we return to read the pushed intent. Notably, local uncertainty limits can not be used to ignore intents with synthetic timestamps that would otherwise be in a reader's uncertainty interval. This is because observed timestamps do not apply to intents/values with synthetic timestamps. So if we know that we will be pushing an intent to a synthetic timestamp, we don't limit the value to a clock reading from the local clock. This came out of some exploration of kvnemesis. I don't think this was actually causing an issue, but it seemed like a potential source of an infinite push + conflict loop. Release note (bug fix): Read-write contention on GLOBAL tables no longer has a potential to thrash without making progress. 64017: sql: add test TestAlterRegionalByRowEnclosingRegionAddDrop r=arulajmani a=arulajmani See individual commits for details. There's more test scenarios I want to add where we sandwich an ADD/DROP region inside of an ALTER to REGIONAL BY ROW/or an operation on one of these tables, but here's a start. The error messages aren't wrapped yet, so they may appear a bit ugly in the jobs table. What this test ensures, which I think is quite valuable, is that the rollback is fine and no manual cleanup is required. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: arulajmani <arulajmani@gmail.com>
This commit reworks the timestamp that pushers use when pushing intents to synthetic timestamps. Usually, as of cockroachdb#59086, a pusher limits the timestamp that it pushes conflicting intents to using the local HLC clock. This is based on the assumption that it will be able to ignore any intent above the local HLC clock's reading using observed timestamps and a local uncertainty limit. However, this argument only holds if we expect to be able to use a local uncertainty limit when we return to read the pushed intent. Notably, local uncertainty limits can not be used to ignore intents with synthetic timestamps that would otherwise be in a reader's uncertainty interval. This is because observed timestamps do not apply to intents/values with synthetic timestamps. So if we know that we will be pushing an intent to a synthetic timestamp, we don't limit the value to a clock reading from the local clock. This came out of some exploration of kvnemesis. I don't think this was actually causing an issue, but it seemed like a potential source of an infinite push + conflict loop. Release note (bug fix): Read-write contention on GLOBAL tables no longer has a potential to thrash without making progress.
This backports `TestCluster.MoveRangeLeaseNonCooperatively` and related test infra from cockroachdb#59086, avoiding backporting the code changes. Release note: None
This backports `TestCluster.MoveRangeLeaseNonCooperatively` and related test infra from cockroachdb#59086, avoiding backporting the code changes. Release note: None
Needed for #57688.
This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general.
Background
Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced
RequestLeaseRequest. If the lease is currently held by a different replica, the request is redirected to that replica using aNotLeaseHolderError. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder.This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in
Store.Send, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock before checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins.This structure is inflexible, subtle, and falls over as soon as we try to extend it.
Motivation
The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock.
So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!).
Solution
The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease.
This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will.
Other benefits
Besides this primary change, a number of other benefits fall out of this restructuring.
TransferLeaseRequestandSubsumeRequest, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere.MergeInProgressErrorby classifying a newInvalidLeaseErroras a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry.checkExecutionCanProceed. So we grab the replica read lock one fewer time in the request path.Other behavioral changes
There are two auxiliary behavioral changes made by this commit that deserve attention.
The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers.
The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.