rangefeed: don't advance resolved ts on txn abort by aliher1911 · Pull Request #111218 · cockroachdb/cockroach

aliher1911 · 2023-09-25T17:24:28Z

Previously when rangefeed received aborted status for pushed transaction it would eagerly dropped all known intents that belonged to the transactions from its cache.
This was proven incorrect because we can get ABORTED status event when transaction is comitted if leaseholder already removed transaction record.
This commit removes the shortcut and always relies on intent operations to advance resolved timestamp.

This is largely a revert of #35889

Epic: none

Fixes: #104309

Release note: None

cockroach-teamcity · 2023-09-25T17:24:38Z

This change is

Previously when rangefeed received aborted status for pushed transaction it would eagerly dropped all known intents that belonged to the transactions from its cache. This was proven incorrect because we can get ABORTED status event when transaction is comitted if leaseholder already removed transaction record. This commit removes the shortcut and always relies on intent operations to advance resolved timestamp. Epic: none Release note: None

erikgrinaker

This makes sense as an immediate fix for the incorrect rangefeed RTS advance, and the effect should be no worse than that of committed transactions which will hold up the RTS until the intents are resolved.

However, it seems generally problematic that SynthesizeTxnFromMeta returns ABORTED for committed transactions. Is this something the txns group should look into @nvanbenschoten?

We should backport this after some baking time as well.

Reviewed 8 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)

-- commits line 15 at r1:
This deserves a release note, since it fixes a potential correctness bug with changefeeds.

pkg/kv/kvserver/rangefeed/task.go line 326 at r1 (raw file):

	ops := make([]enginepb.MVCCLogicalOp, len(pushedTxns))
	var intentsToCleanup []roachpb.LockUpdate
	pushed := 0

nit: we can make ops zero-length (with len(pushedTxns) capacity) and append to it, rather than tracking pushed separately.

pkg/kv/kvserver/rangefeed/task_test.go line 527 at r1 (raw file):

			updateIntentOp(txn2, hlc.Timestamp{WallTime: 2}),
			// abortTxnOp(txn3),
			// abortTxnOp(txn4),

nit: remove these?

erikgrinaker

PS: I'd like an approval from KV transactions as well, either @nvanbenschoten or @arulajmani.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)

nvb

I do not think this PR is correct as is currently written. See my comment below about transaction records for ABORTED transactions that do not carry any lock spans. In those cases, no intent resolution is on its way, so we (the rangefeed processor) needs to do something about these intents. But we are also not told of where the intents are in the push response.

As written, I believe resolved timestamps are going to get stuck for abandoned transactions which have some written intents (e.g. after a gateway crash).

However, it seems generally problematic that SynthesizeTxnFromMeta returns ABORTED for committed transactions. Is this something the txns group should look into @nvanbenschoten?

I don't know of other cases where this is more generally problematic. Do you? Recall that a transaction can only get into this state once all of its intents have been resolved. After that state, we need some eventual way to GC its state, which requires losing information.

The problem is that a rangefeed processor may be operating on a stale view of the range, where a completed intent resolution may not have yet propagated. For operations like follower reads, this works out fine because the operation routes back through the leaseholder in these cases, which detects the stale case and handles it correctly. Rangefeed is unique in operating on stale intent information but then not validating that information with the leaseholder before making decisions based on it. See my comment below about how we could correct this by waiting for the follower to catch up to (through application) the leaseholder and avoid this problem.

Reviewed 1 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @itsbilal)

-- commits line 9 at r1:
The other condition is that every intent has already been resolved. The problem raised in #104309 is that an intent may have been resolved on the leaseholder of its range, but not on the follower that is serving the rangefeed.

Spelling this out reveals a second solution, which is that we could keep the same behavior as we have today if, after finding an ABORTED transaction record status, we waited for the follower of the rangefeed range to catch up to the leaseholder. Something like a WaitForApplication call would be enough to achieve this.

After waiting for the follower to catch up, we could then perform the same MVCCAbortTxnOp to handle the case of an abandoned transaction.

pkg/kv/kvserver/rangefeed/task.go line 368 at r1 (raw file):

However, if the txn happens to have its LockSpans populated, then lets
clean up the intents within the processor's range as an optimization to
help others and to prevent any rangefeed reconnections from needing to
push the same txn. If we aborted the txn, then it won't have its
LockSpans populated. If, however, we ran into a transaction that its
coordinator tried to rollback but didn't follow up with garbage
collection, then LockSpans will be populated.

This paragraph was important, because it reveals the possibility that a transaction record is aborted without carrying any lock spans. Is that case, no intent resolution is coming and the transaction record isn't telling us where the intents are located. So either we need to find them through a lock table scan or we need to stop relying on reference counting and forget about the transaction though a mechanism like MVCCAbortTxnOp.

As is currently written, I think this will lead to resolved timestamp stalls in cases like gateway node crashes.

erikgrinaker

Nice catch Nathan, thanks. We should try to construct a test case for this scenario.

I don't know of other cases where this is more generally problematic. Do you?

I don't know of a concrete problem, no, but this seems like an accident waiting to happen. Ideally, we would synthesize the meta as COMMITTED rather than ABORTED, by addressing this TODO:

cockroach/pkg/kv/kvserver/replica_tscache.go

Lines 536 to 542 in 77640e5

    
           // TODO(andrei): We could keep a bit more info in the tscache to return a 
        
           // different error for COMMITTED transactions. If the EndTxn(commit) was 
        
           // the only request in the batch, this this would be sufficient for the 
        
           // client to swallow the error and declare the transaction as committed. 
        
           // If there were other requests in the EndTxn batch, then the client would 
        
           // still have trouble reconstructing the result, but at least it could 
        
           // provide a non-ambiguous error to the application.

Reviewed 1 of 9 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)

-- commits line 9 at r1:

we could keep the same behavior as we have today if, after finding an ABORTED transaction record status, we waited for the follower of the rangefeed range to catch up to the leaseholder

I think I'm partial to this solution. The other option, a lock table scan and intent resolution, can be expensive in the case of many intents. If the follower is lagging far behind the leaseholder, then the resolved timestamp will already be lagging far behind the current time, so waiting out the replication and application seems less problematic.

Can we limit this to the case where we are not the leaseholder and the aborted txn record has no lock spans?

Something like a WaitForApplication call would be enough to achieve this.

It would, but it's a bit annoying since we need to know which LAI to wait for, which we'll have to get from the leaseholder. And once we have that, we don't really need to go via a WaitForApplication RPC to probe the replica state, we can just wait for the local replica to reach it inline (if we're ok with blocking here) -- although we may not have easy access to the replica here, so maybe WaitForApplication is more straightforward.

We can't know which timestamp the txn would have committed at, can we? If we did, then we could also check if that's already below the closed timestamp, and omit the replication roundtrip.

nvb

Ideally, we would synthesize the meta as COMMITTED rather than ABORTED, by addressing this TODO

That TODO could help in some cases, but it's not an exhaustive solution. If we're using the timestamp cache (or any in-memory, unreplicated data structure) to store information about committed transactions then there will be cases where this information is lost. Similarly, if we're storing this information in persistent storage then we eventually need to GC it to avoid unbounded build-up. So there doesn't seem to be an approach that avoids losing information and needing to handle these ambiguous cases.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @itsbilal)

-- commits line 9 at r1:

Can we limit this to the case where we are not the leaseholder and the aborted txn record has no lock spans?

I don't think we want to guarantee that if a txn record has some lock spans, that it has every lock span. You could imagine a world where long-running transactions would checkpoint their lock spans periodically to bound the number of locks that they abandon without tracking on failure.

It would, but it's a bit annoying since we need to know which LAI to wait for, which we'll have to get from the leaseholder. And once we have that, we don't really need to go via a WaitForApplication RPC to probe the replica state, we can just wait for the local replica to reach it inline (if we're ok with blocking here) -- although we may not have easy access to the replica here, so maybe WaitForApplication is more straightforward.

Agreed. If we can locate the leaseholder nd grab its LAI, we should have all we need to perform a replica-local wait.

Pushing a write through Raft and waiting for it to come out the other end locally would avoid the clock synchronization dependency, but that case is probably not worth considering. I suspect that we already have a dependency on clocks from the PushTxn.

We can't know which timestamp the txn would have committed at, can we? If we did, then we could also check if that's already below the closed timestamp, and omit the replication roundtrip.

I don't understand this idea, but it sounds interesting. Mind spelling it out further?

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)

-- commits line 9 at r1:

I don't understand this idea, but it sounds interesting. Mind spelling it out further?

This was pretty half-baked. The idea was that if we know the commit timestamp, and the closed timestamp has advanced past the commit timestamp, then we know that we've seen all intents written by the transaction, so we can flush them and advance the resolved timestamp.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)

-- commits line 9 at r1:

If we can locate the leaseholder nd grab its LAI, we should have all we need to perform a replica-local wait.

FWIW, I don't think we have a way to do this currently, but we can consider piggybacking on the LeaseInfo RPC by adding a field for the LAI. This complicates backports a little, in that we have to handle leaseholders that don't support it, but that seems ok.

cockroach/pkg/kv/kvpb/api.proto

Lines 1631 to 1650 in 4518a99

    
           // LeaseInfoResponse is the response to a LeaseInfo() operation. 
        
           message LeaseInfoResponse{ 
        
             ResponseHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true]; 
        
             // The last lease known by the replica serving the request. It can also be the 
        
             // tentative future lease, if a lease transfer is in progress. 
        
             Lease lease = 2 [(gogoproto.nullable) = false]; 
        
             // current_lease is set if `lease` represents a tentative future lease. In 
        
             // that case, current_lease represents the lease that's currently in effect. 
        
             Lease current_lease = 3; 
        
             // evaluated_by returns the store that evaluated this request. This 
        
             // corresponds to the leaseholder unless ReadConsistency=INCONSISTENT was 
        
             // used. The response reflects the evaluator's view of the lease. When the 
        
             // client cares to see a particular node's view, it can use this field to 
        
             // check whether the node it intended query (by sending the request to that 
        
             // node and using ReadConsistency=INCONSISTENT) indeed served it - it's 
        
             // possible that even if ReadConsistency=INCONSISTENT was used, the request is 
        
             // still not evaluated by the node it was sent to if that node's replica is a 
        
             // learner or the node doesn't have a replica at all. 
        
             int32 evaluated_by = 4 [(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.StoreID"]; 
        
           }

aliher1911 · 2023-10-04T09:28:18Z

What if we have a way to indicate that aborted transaction record was synthesized? Because that is the only special case that we need to handle. For "real" aborted transactions we can just ignore as we used to.

nvb

We actually do expose this through the TxnRecordExists flag on the QueryTxnResponse. It would be reasonable to add the same flag on PushTxnResponse.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @itsbilal)

aliher1911 · 2023-10-09T11:55:09Z

When we synthesise transaction records from metadata in ABORT case, if we have an oldest write timestamp for the transaction then we don't need to use LAI and wait for it specifically. We receive closed timestamp notifications in processor so we can just get rid of aborted transactions when we reach that timestamp in processor.

erikgrinaker · 2023-10-09T12:42:06Z

When we synthesise transaction records from metadata in ABORT case, if we have an oldest write timestamp for the transaction then we don't need to use LAI and wait for it specifically. We receive closed timestamp notifications in processor so we can just get rid of aborted transactions when we reach that timestamp in processor.

The problem is that the txn meta the pusher sees can be outdated if the txn was anchored on a different range and has since been pushed by someone else.

So if the rangefeed pusher is looking at an old intent at t1, but the transaction instead committed and cleaned up intents at t2, then pushing it at t1 and waiting for the closed timestamp to reach t1 will remove the intents from tracking, allowing it to emit a closed timestamp for t3 > t2 before emitting the values at t2.

aliher1911 · 2023-10-09T13:01:58Z

The problem is that the txn meta the pusher sees can be outdated if the txn was anchored on a different range and has since been pushed by someone else.

I was thinking that when we push, it is replica that owns transaction anchor synthesised transaction record for pusher. So it must know olderst possible timestamp that it has written to. Even if that's a different range, once "our" leaseholder signals its closedTS is past transaction TS it should be safe to assume we won't receive intents with earlier timestamps.

erikgrinaker · 2023-10-09T13:11:16Z

I was thinking that when we push, it is replica that owns transaction anchor synthesised transaction record for pusher. So it must know olderst possible timestamp that it has written to. Even if that's a different range, once "our" leaseholder signals its closedTS is past transaction TS it should be safe to assume we won't receive intents with earlier timestamps.

But how do we know what that timestamp is? The transaction record is gone, and I don't believe we keep track of the latest write timestamp (although it would be useful to have). We could of course use the replica's current time, but we're incurring a ~3 second wait in that case.

aliher1911 · 2023-10-09T13:39:40Z

We could of course use the replica's current time, but we're incurring a ~3 second wait in that case.

Isn't it the same if we want to use WaitForApplication and get LAI from leaseholder as proposed by @nvanbenschoten above? Except for the case where range is idle and LAI wasn't increased in last 3 seconds. But that is not very interesting case and we can wait in such a case.

erikgrinaker · 2023-10-09T13:41:30Z

We could of course use the replica's current time, but we're incurring a ~3 second wait in that case.

Isn't it the same if we want to use WaitForApplication and get LAI from leaseholder as proposed by @nvanbenschoten above? Except for the case where range is idle and LAI wasn't increased in last 3 seconds. But that is not very interesting case and we can wait in such a case.

No, with the LAI we'll typically not wait more than a replication roundtrip (if that), which is significantly less than 3 seconds. And I don't think we should use WaitForApplication, we should just fetch the leaseholder's LAI and wait for the local replica to reach it (which it normally will have by the time we obtain the LAI).

aliher1911 · 2023-10-09T13:47:40Z

Fair point. I was referring to WaitForApplication as we need to apply LAI on replica before we can safely assume that intents are applied. But current time is trailing closed TS so it is different indeed.

erikgrinaker · 2023-11-15T13:08:48Z

Replaced by #114487.

aliher1911 force-pushed the rangefeed_dont_shortcut_rts_on_abort branch 2 times, most recently from a206214 to 116578e Compare September 26, 2023 18:38

aliher1911 marked this pull request as ready for review September 26, 2023 18:39

aliher1911 requested a review from a team as a code owner September 26, 2023 18:39

aliher1911 requested review from a team, itsbilal and nvb September 26, 2023 18:39

aliher1911 force-pushed the rangefeed_dont_shortcut_rts_on_abort branch from 116578e to 97a30d2 Compare September 27, 2023 08:47

erikgrinaker approved these changes Oct 2, 2023

View reviewed changes

erikgrinaker reviewed Oct 2, 2023

View reviewed changes

nvb suggested changes Oct 2, 2023

View reviewed changes

erikgrinaker reviewed Oct 3, 2023

View reviewed changes

nvb reviewed Oct 3, 2023

View reviewed changes

erikgrinaker self-requested a review October 4, 2023 08:05

erikgrinaker reviewed Oct 4, 2023

View reviewed changes

nvb reviewed Oct 4, 2023

View reviewed changes

erikgrinaker mentioned this pull request Nov 15, 2023

rangefeed: fix premature checkpoint due to intent resolution race #114487

Closed

erikgrinaker closed this Nov 15, 2023

	// TODO(andrei): We could keep a bit more info in the tscache to return a
	// different error for COMMITTED transactions. If the EndTxn(commit) was
	// the only request in the batch, this this would be sufficient for the
	// client to swallow the error and declare the transaction as committed.
	// If there were other requests in the EndTxn batch, then the client would
	// still have trouble reconstructing the result, but at least it could
	// provide a non-ambiguous error to the application.

	// LeaseInfoResponse is the response to a LeaseInfo() operation.
	message LeaseInfoResponse{
	ResponseHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
	// The last lease known by the replica serving the request. It can also be the
	// tentative future lease, if a lease transfer is in progress.
	Lease lease = 2 [(gogoproto.nullable) = false];
	// current_lease is set if `lease` represents a tentative future lease. In
	// that case, current_lease represents the lease that's currently in effect.
	Lease current_lease = 3;
	// evaluated_by returns the store that evaluated this request. This
	// corresponds to the leaseholder unless ReadConsistency=INCONSISTENT was
	// used. The response reflects the evaluator's view of the lease. When the
	// client cares to see a particular node's view, it can use this field to
	// check whether the node it intended query (by sending the request to that
	// node and using ReadConsistency=INCONSISTENT) indeed served it - it's
	// possible that even if ReadConsistency=INCONSISTENT was used, the request is
	// still not evaluated by the node it was sent to if that node's replica is a
	// learner or the node doesn't have a replica at all.
	int32 evaluated_by = 4 [(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.StoreID"];
	}

Conversation

aliher1911 commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Sep 25, 2023

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

aliher1911 commented Oct 4, 2023

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aliher1911 commented Oct 9, 2023

Uh oh!

erikgrinaker commented Oct 9, 2023

Uh oh!

aliher1911 commented Oct 9, 2023

Uh oh!

erikgrinaker commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aliher1911 commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aliher1911 commented Oct 9, 2023

Uh oh!

erikgrinaker commented Nov 15, 2023 • edited by nicktrav Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aliher1911 commented Sep 25, 2023 •

edited

Loading

erikgrinaker commented Oct 9, 2023 •

edited

Loading

aliher1911 commented Oct 9, 2023 •

edited

Loading

erikgrinaker commented Oct 9, 2023 •

edited

Loading

erikgrinaker commented Nov 15, 2023 •

edited by nicktrav

Loading