Skip to content

storage: don't perform one-phase commit transactions after restarts#40518

Merged
craig[bot] merged 5 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/1pcEpoch
Sep 6, 2019
Merged

storage: don't perform one-phase commit transactions after restarts#40518
craig[bot] merged 5 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/1pcEpoch

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Sep 5, 2019

Fixes #40466.

This commit adjusts the Replica-side logic around executing one-phase commit transactions. It prevents batches that could be executed on the 1PC fast-path from doing so if they contain transactions in any epoch other than their first epoch. We saw in #40466 that this could lead to intent abandonment if the txn had written intents in earlier epochs.

I believe that this is a new issue that was introduced when we moved from using the presence of a BeginTxn and EndTxn in the same batch to detect 1PC transactions to using the sequence numbers of writes and the presence of an EndTxn in a batch to detect 1PC transactions. That change was necessary for parallel commits. Before that change, I think logic in DistSender was preventing us from hitting this case because it would always split EndTxn requests from the rest of its batch when a txn had been restarted:

// To ensure that we lay down intents to prevent starvation, always
// split the end transaction request into its own batch on retries.
// Txns requiring 1PC are an exception and should never be split.
if ba.Txn != nil && ba.Txn.Epoch > 0 && !require1PC {
splitET = true
}

nvb added 3 commits September 5, 2019 12:02
This was no longer in use, as of 374413f.

Release note: None
Fixes cockroachdb#40466.

This commit adjusts the Replica-side logic around executing one-phase commit
transactions. It prevents batches that could be executed on the 1PC fast-path
from doing so if they contain transactions in any epoch other than their first
epoch. We saw in cockroachdb#40466 that this could lead to intent abandonment if the txn
had written intents in earlier epochs.

I believe that this is a new issue that was introduced when we moved from using
the presence of a BeginTxn and EndTxn in the same batch to detect 1PC transactions
to using the sequence numbers of writes and the presence of an EndTxn in a batch
to detect 1PC transactions. That change was necessary for parallel commits. Before
that change, I think logic in DistSender was preventing us from hitting this case
because it would always split EndTxn requests from the rest of its batch when a
txn had been restarted:
 https://github.com/cockroachdb/cockroach/blob/18bdfe1a691fa6d785d510e86d27cecdac9c436e/pkg/kv/dist_sender.go#L692-L697

Release note: None
This change correctly sets sequence numbers in TestIsOnePhaseCommit
so that we can test the detection of one-phase commit transactions
whose batches do not contain BeginTransaction requests.

Release note: None
@nvb nvb requested a review from tbg September 5, 2019 16:32
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

nvb added 2 commits September 5, 2019 21:00
As of cockroachdb#35284, this was already checked by IsEndTransactionTriggeringRetryError.

Release note: None
We can then use this to centralize its logic and avoid leaking it
around 1PC transaction handling.

Release note: None
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Sep 6, 2019

bors r+

craig bot pushed a commit that referenced this pull request Sep 6, 2019
40518: storage: don't perform one-phase commit transactions after restarts r=nvanbenschoten a=nvanbenschoten

Fixes #40466.

This commit adjusts the Replica-side logic around executing one-phase commit transactions. It prevents batches that could be executed on the 1PC fast-path from doing so if they contain transactions in any epoch other than their first epoch. We saw in #40466 that this could lead to intent abandonment if the txn had written intents in earlier epochs.

I believe that this is a new issue that was introduced when we moved from using the presence of a BeginTxn and EndTxn in the same batch to detect 1PC transactions to using the sequence numbers of writes and the presence of an EndTxn in a batch to detect 1PC transactions. That change was necessary for parallel commits. Before that change, I think logic in DistSender was preventing us from hitting this case because it would always split EndTxn requests from the rest of its batch when a txn had been restarted: https://github.com/cockroachdb/cockroach/blob/18bdfe1a691fa6d785d510e86d27cecdac9c436e/pkg/kv/dist_sender.go#L692-L697

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 6, 2019

Build succeeded

@craig craig bot merged commit b738ed2 into cockroachdb:master Sep 6, 2019
@nvb nvb deleted the nvanbenschoten/1pcEpoch branch September 9, 2019 19:21
nvb added a commit to nvb/cockroach that referenced this pull request Sep 26, 2019
Found while verifying that cockroachdb#40518 didn't negatively impact YCSB performance.
As it turns out, most txns that restart hit the check in DistSender instead,
and that ends up being critical for forward progress. Without it, transactions
appear to starve because they never write intents.

The commit doesn't make any changes there, but it does fix an issue where
1PC transactions were being incorrectly detected. The detection was ignoring
the fact that a 1PC attempt could be rejected by DistSender or a Replica.
We now tie the metric to whether an EndTransaction actually evaluated as
a 1PC txn instead of tying it to whether the EndTransaction _wanted_ to
be evaluated as a 1PC txn.

Release justification: low risk and improves metric reporting

Release note: None
craig bot pushed a commit that referenced this pull request Sep 26, 2019
41098: roachtest/jepsen: don't fail test if retrieving invoke.log fails r=nvanbenschoten a=nvanbenschoten

Fixes #41062.

Release justification: Testing only.

Release note: None

41104: kv: check for 1-phase commit after request, not before r=nvanbenschoten a=nvanbenschoten

Found while verifying that #40518 didn't negatively impact YCSB performance.
As it turns out, most txns that restart hit the check in DistSender instead,
and that ends up being critical for forward progress. Without it, transactions
appear to starve because they never write intents.

The commit doesn't make any changes there, but it does fix an issue where
1PC transactions were being incorrectly detected. The detection was ignoring
the fact that a 1PC attempt could be rejected by DistSender or a Replica.
We now tie the metric to whether an EndTransaction actually evaluated as
a 1PC txn instead of tying it to whether the EndTransaction _wanted_ to
be evaluated as a 1PC txn.

Release justification: low risk and improves metric reporting

Release note: None

41127: Add myself to AUTHORS r=miretskiy a=miretskiy

Authors += Myself

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

teamcity: failed test: TestTxnCoordSenderRetries

3 participants