rangefeed: don't advance resolved ts on txn abort#111218
rangefeed: don't advance resolved ts on txn abort#111218aliher1911 wants to merge 1 commit intocockroachdb:masterfrom
Conversation
a206214 to
116578e
Compare
Previously when rangefeed received aborted status for pushed transaction it would eagerly dropped all known intents that belonged to the transactions from its cache. This was proven incorrect because we can get ABORTED status event when transaction is comitted if leaseholder already removed transaction record. This commit removes the shortcut and always relies on intent operations to advance resolved timestamp. Epic: none Release note: None
116578e to
97a30d2
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
This makes sense as an immediate fix for the incorrect rangefeed RTS advance, and the effect should be no worse than that of committed transactions which will hold up the RTS until the intents are resolved.
However, it seems generally problematic that SynthesizeTxnFromMeta returns ABORTED for committed transactions. Is this something the txns group should look into @nvanbenschoten?
We should backport this after some baking time as well.
Reviewed 8 of 9 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
-- commits line 15 at r1:
This deserves a release note, since it fixes a potential correctness bug with changefeeds.
pkg/kv/kvserver/rangefeed/task.go line 326 at r1 (raw file):
ops := make([]enginepb.MVCCLogicalOp, len(pushedTxns)) var intentsToCleanup []roachpb.LockUpdate pushed := 0
nit: we can make ops zero-length (with len(pushedTxns) capacity) and append to it, rather than tracking pushed separately.
pkg/kv/kvserver/rangefeed/task_test.go line 527 at r1 (raw file):
updateIntentOp(txn2, hlc.Timestamp{WallTime: 2}), // abortTxnOp(txn3), // abortTxnOp(txn4),
nit: remove these?
erikgrinaker
left a comment
There was a problem hiding this comment.
PS: I'd like an approval from KV transactions as well, either @nvanbenschoten or @arulajmani.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
nvb
left a comment
There was a problem hiding this comment.
I do not think this PR is correct as is currently written. See my comment below about transaction records for ABORTED transactions that do not carry any lock spans. In those cases, no intent resolution is on its way, so we (the rangefeed processor) needs to do something about these intents. But we are also not told of where the intents are in the push response.
As written, I believe resolved timestamps are going to get stuck for abandoned transactions which have some written intents (e.g. after a gateway crash).
However, it seems generally problematic that
SynthesizeTxnFromMetareturnsABORTEDfor committed transactions. Is this something the txns group should look into @nvanbenschoten?
I don't know of other cases where this is more generally problematic. Do you? Recall that a transaction can only get into this state once all of its intents have been resolved. After that state, we need some eventual way to GC its state, which requires losing information.
The problem is that a rangefeed processor may be operating on a stale view of the range, where a completed intent resolution may not have yet propagated. For operations like follower reads, this works out fine because the operation routes back through the leaseholder in these cases, which detects the stale case and handles it correctly. Rangefeed is unique in operating on stale intent information but then not validating that information with the leaseholder before making decisions based on it. See my comment below about how we could correct this by waiting for the follower to catch up to (through application) the leaseholder and avoid this problem.
Reviewed 1 of 9 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @itsbilal)
-- commits line 9 at r1:
The other condition is that every intent has already been resolved. The problem raised in #104309 is that an intent may have been resolved on the leaseholder of its range, but not on the follower that is serving the rangefeed.
Spelling this out reveals a second solution, which is that we could keep the same behavior as we have today if, after finding an ABORTED transaction record status, we waited for the follower of the rangefeed range to catch up to the leaseholder. Something like a WaitForApplication call would be enough to achieve this.
After waiting for the follower to catch up, we could then perform the same MVCCAbortTxnOp to handle the case of an abandoned transaction.
pkg/kv/kvserver/rangefeed/task.go line 368 at r1 (raw file):
However, if the txn happens to have its LockSpans populated, then lets
clean up the intents within the processor's range as an optimization to
help others and to prevent any rangefeed reconnections from needing to
push the same txn. If we aborted the txn, then it won't have its
LockSpans populated. If, however, we ran into a transaction that its
coordinator tried to rollback but didn't follow up with garbage
collection, then LockSpans will be populated.
This paragraph was important, because it reveals the possibility that a transaction record is aborted without carrying any lock spans. Is that case, no intent resolution is coming and the transaction record isn't telling us where the intents are located. So either we need to find them through a lock table scan or we need to stop relying on reference counting and forget about the transaction though a mechanism like MVCCAbortTxnOp.
As is currently written, I think this will lead to resolved timestamp stalls in cases like gateway node crashes.
erikgrinaker
left a comment
There was a problem hiding this comment.
Nice catch Nathan, thanks. We should try to construct a test case for this scenario.
I don't know of other cases where this is more generally problematic. Do you?
I don't know of a concrete problem, no, but this seems like an accident waiting to happen. Ideally, we would synthesize the meta as COMMITTED rather than ABORTED, by addressing this TODO:
cockroach/pkg/kv/kvserver/replica_tscache.go
Lines 536 to 542 in 77640e5
Reviewed 1 of 9 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
we could keep the same behavior as we have today if, after finding an ABORTED transaction record status, we waited for the follower of the rangefeed range to catch up to the leaseholder
I think I'm partial to this solution. The other option, a lock table scan and intent resolution, can be expensive in the case of many intents. If the follower is lagging far behind the leaseholder, then the resolved timestamp will already be lagging far behind the current time, so waiting out the replication and application seems less problematic.
Can we limit this to the case where we are not the leaseholder and the aborted txn record has no lock spans?
Something like a
WaitForApplicationcall would be enough to achieve this.
It would, but it's a bit annoying since we need to know which LAI to wait for, which we'll have to get from the leaseholder. And once we have that, we don't really need to go via a WaitForApplication RPC to probe the replica state, we can just wait for the local replica to reach it inline (if we're ok with blocking here) -- although we may not have easy access to the replica here, so maybe WaitForApplication is more straightforward.
We can't know which timestamp the txn would have committed at, can we? If we did, then we could also check if that's already below the closed timestamp, and omit the replication roundtrip.
nvb
left a comment
There was a problem hiding this comment.
Ideally, we would synthesize the meta as
COMMITTEDrather thanABORTED, by addressing this TODO
That TODO could help in some cases, but it's not an exhaustive solution. If we're using the timestamp cache (or any in-memory, unreplicated data structure) to store information about committed transactions then there will be cases where this information is lost. Similarly, if we're storing this information in persistent storage then we eventually need to GC it to avoid unbounded build-up. So there doesn't seem to be an approach that avoids losing information and needing to handle these ambiguous cases.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @itsbilal)
Can we limit this to the case where we are not the leaseholder and the aborted txn record has no lock spans?
I don't think we want to guarantee that if a txn record has some lock spans, that it has every lock span. You could imagine a world where long-running transactions would checkpoint their lock spans periodically to bound the number of locks that they abandon without tracking on failure.
It would, but it's a bit annoying since we need to know which LAI to wait for, which we'll have to get from the leaseholder. And once we have that, we don't really need to go via a
WaitForApplicationRPC to probe the replica state, we can just wait for the local replica to reach it inline (if we're ok with blocking here) -- although we may not have easy access to the replica here, so maybeWaitForApplicationis more straightforward.
Agreed. If we can locate the leaseholder nd grab its LAI, we should have all we need to perform a replica-local wait.
Pushing a write through Raft and waiting for it to come out the other end locally would avoid the clock synchronization dependency, but that case is probably not worth considering. I suspect that we already have a dependency on clocks from the PushTxn.
We can't know which timestamp the txn would have committed at, can we? If we did, then we could also check if that's already below the closed timestamp, and omit the replication roundtrip.
I don't understand this idea, but it sounds interesting. Mind spelling it out further?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
I don't understand this idea, but it sounds interesting. Mind spelling it out further?
This was pretty half-baked. The idea was that if we know the commit timestamp, and the closed timestamp has advanced past the commit timestamp, then we know that we've seen all intents written by the transaction, so we can flush them and advance the resolved timestamp.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
If we can locate the leaseholder nd grab its LAI, we should have all we need to perform a replica-local wait.
FWIW, I don't think we have a way to do this currently, but we can consider piggybacking on the LeaseInfo RPC by adding a field for the LAI. This complicates backports a little, in that we have to handle leaseholders that don't support it, but that seems ok.
cockroach/pkg/kv/kvpb/api.proto
Lines 1631 to 1650 in 4518a99
|
What if we have a way to indicate that aborted transaction record was synthesized? Because that is the only special case that we need to handle. For "real" aborted transactions we can just ignore as we used to. |
nvb
left a comment
There was a problem hiding this comment.
We actually do expose this through the TxnRecordExists flag on the QueryTxnResponse. It would be reasonable to add the same flag on PushTxnResponse.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @itsbilal)
|
When we synthesise transaction records from metadata in ABORT case, if we have an oldest write timestamp for the transaction then we don't need to use LAI and wait for it specifically. We receive closed timestamp notifications in processor so we can just get rid of aborted transactions when we reach that timestamp in processor. |
The problem is that the txn meta the pusher sees can be outdated if the txn was anchored on a different range and has since been pushed by someone else. So if the rangefeed pusher is looking at an old intent at t1, but the transaction instead committed and cleaned up intents at t2, then pushing it at t1 and waiting for the closed timestamp to reach t1 will remove the intents from tracking, allowing it to emit a closed timestamp for t3 > t2 before emitting the values at t2. |
I was thinking that when we push, it is replica that owns transaction anchor synthesised transaction record for pusher. So it must know olderst possible timestamp that it has written to. Even if that's a different range, once "our" leaseholder signals its closedTS is past transaction TS it should be safe to assume we won't receive intents with earlier timestamps. |
But how do we know what that timestamp is? The transaction record is gone, and I don't believe we keep track of the latest write timestamp (although it would be useful to have). We could of course use the replica's current time, but we're incurring a ~3 second wait in that case. |
Isn't it the same if we want to use |
No, with the LAI we'll typically not wait more than a replication roundtrip (if that), which is significantly less than 3 seconds. And I don't think we should use |
|
Fair point. I was referring to |
|
Replaced by #114487. |
Previously when rangefeed received aborted status for pushed transaction it would eagerly dropped all known intents that belonged to the transactions from its cache.
This was proven incorrect because we can get ABORTED status event when transaction is comitted if leaseholder already removed transaction record.
This commit removes the shortcut and always relies on intent operations to advance resolved timestamp.
This is largely a revert of #35889
Epic: none
Fixes: #104309
Release note: None