rangefeed: don't discard ref count for txn after intent abort#35777
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Mar 15, 2019
Merged
rangefeed: don't discard ref count for txn after intent abort#35777craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Member
f762004 to
48e8d05
Compare
tbg
approved these changes
Mar 15, 2019
Member
tbg
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @danhhz)
danhhz
approved these changes
Mar 15, 2019
Contributor
danhhz
left a comment
There was a problem hiding this comment.
Great find! I'm curious, any theories why this would have become more frequent recently?
Contributor
Author
An increase in transaction retries is the only thing I can think of. #34600 was failing so reliably for other reasons though. Do you think it is safe to say that this became more frequent recently or was it just hidden? bors r+ |
Contributor
Merge conflict (retrying...) |
Contributor
Build failed |
Contributor
Author
|
Flake due to #34081. bors r+ |
Contributor
Merge conflict (retrying...) |
Contributor
Build failed |
Contributor
Author
|
Flaked due to #35484. bors r+ |
Contributor
Build failed |
Fixes cockroachdb#34600. Before this commit, rangefeed made the assumption that an aborted intent implied an aborted transaction (after the rangefeed was in a steady-state). It used this assumption to eagerly discard the bookkeeping for a transaction after its first MVCCAbortIntent operation was seen as an optimization. This was incorrect. A transaction can "abort" an intent even if it commits. This is possible if the intent was only written at a previous epoch and not written in the epoch that ended up finalizing the transaction. This commit removes this assumption, which I confirmed fixes the resolved timestamp regression using the new assertion from cockroachdb#35772. I believe this could have two effects: - it could trigger the assertion failure from cockroachdb#34600 if the transaction experiencing the incorrect assumption was the oldest being tracked at the time of the assumption. - it could cause resolved timestamp stalls and prevent all future forward progress of resolved timestamps if the transaction experiencing the incorrect assumption was not the oldest being tracked at the time of the assumption. Release note: None
48e8d05 to
f19dd96
Compare
Contributor
Author
|
Same issue. I'm debugging now in #35549. bors r+ |
craig bot
pushed a commit
that referenced
this pull request
Mar 15, 2019
35777: rangefeed: don't discard ref count for txn after intent abort r=nvanbenschoten a=nvanbenschoten Fixes #34600. Before this commit, rangefeed made the assumption that an aborted intent implied an aborted transaction (after the rangefeed was in a steady-state). It used this assumption to eagerly discard the bookkeeping for a transaction after its first MVCCAbortIntent operation was seen as an optimization. This was incorrect. A transaction can "abort" an intent even if it commits. This is possible if the intent was only written at a previous epoch and not written in the epoch that ended up finalizing the transaction. This commit removes this assumption, which I confirmed fixes the resolved timestamp regression using the new assertion from #35772. I believe this could have two effects: - it could trigger the assertion failure from #34600 if the transaction experiencing the incorrect assumption was the oldest being tracked at the time of the assumption. - it could cause resolved timestamp stalls and prevent all future forward progress of resolved timestamps if the transaction experiencing the incorrect assumption was not the oldest being tracked at the time of the assumption. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Contributor
Build succeeded |
nvb
added a commit
to nvb/cockroach
that referenced
this pull request
Mar 18, 2019
Fixes cockroachdb#35816. The fix in cockroachdb#35777 inadvertently broke `txnPushAttempt` and its interaction with transactions that were found to be aborted. This commit restores this handling by reverting part of the assertion added in cockroachdb#35772 and splitting up the handling of "aborted intents" from the handling of "aborted transactions". Aborted transactions were always one of the most tricky situations to handle in rangefeed. The challenge is that rangefeed may be tracking a transaction that has been abandoned and may want to push it to unblock its resolved timestamp, but it doesn't have a way to know where the aborted transaction's intents live on its range. The aborted transaction record can't tell us this either. So the only thing rangefeed can do is push the transaction and convince itself that the aborted intents on its range can be safely ignored. This poses a problem for accurate reference counting of intents though. We need to be able to discard a transaction and still handle the eventual resolution of its aborted intents. To handle this properly, the change splits up `MVCCAbortIntentOp` into `MVCCAbortIntentOp` and `MVCCAbortTxnOp`. Before this change, the single operation type was attempting to fill both roles, but they mean fundamentally different things. `MVCCAbortTxnOp` is then given back its ability to delete transactions from the queue directly without worrying about their reference count. Next, the `unresolvedIntentQueue` is extended to be a little more flexible. Instead of asserting that reference counts never drop below zero, it just ignores changes that would make reference counts negative. This has the potential to "leak" reference counts that saturate at zero and are then incremented again, but we know that this will only ever happen for aborted transactions, which can always be pushed again to rediscover that they have been aborted. Release note: None
nvb
added a commit
to nvb/cockroach
that referenced
this pull request
Mar 19, 2019
Fixes cockroachdb#35816. The fix in cockroachdb#35777 inadvertently broke `txnPushAttempt` and its interaction with transactions that were found to be aborted. This commit restores this handling by reverting part of the assertion added in cockroachdb#35772 and splitting up the handling of "aborted intents" from the handling of "aborted transactions". Aborted transactions were always one of the most tricky situations to handle in rangefeed. The challenge is that rangefeed may be tracking a transaction that has been abandoned and may want to push it to unblock its resolved timestamp, but it doesn't have a way to know where the aborted transaction's intents live on its range. The aborted transaction record can't tell us this either. So the only thing rangefeed can do is push the transaction and convince itself that the aborted intents on its range can be safely ignored. This poses a problem for accurate reference counting of intents though. We need to be able to discard a transaction and still handle the eventual resolution of its aborted intents. To handle this properly, the change splits up `MVCCAbortIntentOp` into `MVCCAbortIntentOp` and `MVCCAbortTxnOp`. Before this change, the single operation type was attempting to fill both roles, but they mean fundamentally different things. `MVCCAbortTxnOp` is then given back its ability to delete transactions from the queue directly without worrying about their reference count. Next, the `unresolvedIntentQueue` is extended to be a little more flexible. Instead of asserting that reference counts never drop below zero, it just ignores changes that would make reference counts negative. This has the potential to "leak" reference counts that saturate at zero and are then incremented again, but we know that this will only ever happen for aborted transactions, which can always be pushed again to rediscover that they have been aborted. Release note: None
nvb
added a commit
to nvb/cockroach
that referenced
this pull request
Mar 19, 2019
Fixes cockroachdb#35816. The fix in cockroachdb#35777 inadvertently broke `txnPushAttempt` and its interaction with transactions that were found to be aborted. This commit restores this handling by reverting part of the assertion added in cockroachdb#35772 and splitting up the handling of "aborted intents" from the handling of "aborted transactions". Aborted transactions were always one of the most tricky situations to handle in rangefeed. The challenge is that rangefeed may be tracking a transaction that has been abandoned and may want to push it to unblock its resolved timestamp, but it doesn't have a way to know where the aborted transaction's intents live on its range. The aborted transaction record can't tell us this either. So the only thing rangefeed can do is push the transaction and convince itself that the aborted intents on its range can be safely ignored. This poses a problem for accurate reference counting of intents though. We need to be able to discard a transaction and still handle the eventual resolution of its aborted intents. To handle this properly, the change splits up `MVCCAbortIntentOp` into `MVCCAbortIntentOp` and `MVCCAbortTxnOp`. Before this change, the single operation type was attempting to fill both roles, but they mean fundamentally different things. `MVCCAbortTxnOp` is then given back its ability to delete transactions from the queue directly without worrying about their reference count. Next, the `unresolvedIntentQueue` is extended to be a little more flexible. Instead of asserting that reference counts never drop below zero, it just ignores changes that would make reference counts negative. This has the potential to "leak" reference counts that saturate at zero and are then incremented again, but we know that this will only ever happen for aborted transactions, which can always be pushed again to rediscover that they have been aborted. Release note: None
craig bot
pushed a commit
that referenced
this pull request
Mar 19, 2019
35889: rangefeed: fix handling of aborted transactions found by txnPushAttempt r=nvanbenschoten a=nvanbenschoten Fixes #35816. The fix in #35777 inadvertently broke `txnPushAttempt` and its interaction with transactions that were found to be aborted. This commit restores this handling by reverting part of the assertion added in #35772 and splitting up the handling of "aborted intents" from the handling of "aborted transactions". Aborted transactions were always one of the most tricky situations to handle in rangefeed. The challenge is that rangefeed may be tracking a transaction that has been abandoned and may want to push it to unblock its resolved timestamp, but it doesn't have a way to know where the aborted transaction's intents live on its range. The aborted transaction record can't tell us this either. So the only thing rangefeed can do is push the transaction and convince itself that the aborted intents on its range can be safely ignored. This poses a problem for accurate reference counting of intents though. We need to be able to discard a transaction and still handle the eventual resolution of its aborted intents. To handle this properly, the change splits up `MVCCAbortIntentOp` into `MVCCAbortIntentOp` and `MVCCAbortTxnOp`. Before this change, the single operation type was attempting to fill both roles, but they mean fundamentally different things. `MVCCAbortTxnOp` is then given back its ability to delete transactions from the queue directly without worrying about their reference count. Next, the `unresolvedIntentQueue` is extended to be a little more flexible. Instead of asserting that reference counts never drop below zero, it just ignores changes that would make reference counts negative. This has the potential to "leak" reference counts that saturate at zero and are then incremented again, but we know that this will only ever happen for aborted transactions, which can always be pushed again to rediscover that they have been aborted. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvb
added a commit
to nvb/cockroach
that referenced
this pull request
Mar 21, 2019
Fixes cockroachdb#35816. The fix in cockroachdb#35777 inadvertently broke `txnPushAttempt` and its interaction with transactions that were found to be aborted. This commit restores this handling by reverting part of the assertion added in cockroachdb#35772 and splitting up the handling of "aborted intents" from the handling of "aborted transactions". Aborted transactions were always one of the most tricky situations to handle in rangefeed. The challenge is that rangefeed may be tracking a transaction that has been abandoned and may want to push it to unblock its resolved timestamp, but it doesn't have a way to know where the aborted transaction's intents live on its range. The aborted transaction record can't tell us this either. So the only thing rangefeed can do is push the transaction and convince itself that the aborted intents on its range can be safely ignored. This poses a problem for accurate reference counting of intents though. We need to be able to discard a transaction and still handle the eventual resolution of its aborted intents. To handle this properly, the change splits up `MVCCAbortIntentOp` into `MVCCAbortIntentOp` and `MVCCAbortTxnOp`. Before this change, the single operation type was attempting to fill both roles, but they mean fundamentally different things. `MVCCAbortTxnOp` is then given back its ability to delete transactions from the queue directly without worrying about their reference count. Next, the `unresolvedIntentQueue` is extended to be a little more flexible. Instead of asserting that reference counts never drop below zero, it just ignores changes that would make reference counts negative. This has the potential to "leak" reference counts that saturate at zero and are then incremented again, but we know that this will only ever happen for aborted transactions, which can always be pushed again to rediscover that they have been aborted. Release note: None
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #34600.
Before this commit, rangefeed made the assumption that an aborted intent
implied an aborted transaction (after the rangefeed was in a steady-state).
It used this assumption to eagerly discard the bookkeeping for a transaction
after its first MVCCAbortIntent operation was seen as an optimization.
This was incorrect. A transaction can "abort" an intent even if it commits.
This is possible if the intent was only written at a previous epoch and not
written in the epoch that ended up finalizing the transaction. This commit
removes this assumption, which I confirmed fixes the resolved timestamp
regression using the new assertion from #35772.
I believe this could have two effects:
experiencing the incorrect assumption was the oldest being tracked at
the time of the assumption.
progress of resolved timestamps if the transaction experiencing the incorrect
assumption was not the oldest being tracked at the time of the assumption.
Release note: None