Skip to content

mvcc: don't write below provisional commit timestamp on WriteTooOld errors#35086

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/writeTooOldTs
Feb 21, 2019
Merged

mvcc: don't write below provisional commit timestamp on WriteTooOld errors#35086
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/writeTooOldTs

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Feb 20, 2019

Fixes #34853.

This change fixes #34853, which appears to have been broken by
57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special-purpose code to forward the
MVCCLogicalOpDetails timestamp to the provisional commit timestamp of the
transaction because it assumed that the new writeTimestamp variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this writeTimestamp variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates WriteTooOldErrors). In
this case, we were replacing the writeTimestamp with the committed value's
timestamp + 1 logical tick. This regression allowed MVCCWriteIntentOps to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting writeTimestamp regress in this WriteTooOldError code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by #32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None

@nvb nvb requested review from a team, danhhz and tbg February 20, 2019 06:12
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you got to the bottom of this so fast! Man, that method is hard to follow. I gave this what I hope is a thorough review but will give it another one when you've figured out the lost update.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, and @tbg)


pkg/storage/engine/mvcc.go, line 1466 at r1 (raw file):

			}
		} else if !metaTimestamp.Less(readTimestamp) {
			// This is the case where we're trying to write under a committed

Not necessarily, right? So far we only know that there's a committed write above the read timestamp (but the write timestamp could be above the committed value). Nvm, you mull on that below.


pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):

			// we'll need to refresh anyway, right? It might also be interacting
			// poorly with our logic in `evaluateBatch`. Any idea @bdarnell or
			// @tbg?

For snapshot transactions, this is just lost update -- two txn start at origts=100, their commit ts advances to 150 and 200 respectively, and then txn1 writes key=value1 (and commits) and txn2 cputs key from nil to value2, in effect wiping out value1 without ever having to observe it. I also don't see how this applies to serializable txns, though. As you mention, if we lay down an intent we're going to have to refresh anyway, and that should see the update that would be lost and force a restart. Worth getting to the bottom of even though I'm not sure avoiding the WTO is a good idea here. After all, what are the odds the write will survive the refresh? At that point we might as well check right here whether it would survive a refresh, though that seems to be optimizing for something we don't know is worth the added complexity.

PS this method probably has a cyclomatic complexity of MaxUint64... not that I see any easy way of changing that.


pkg/storage/engine/mvcc.go, line 1484 at r1 (raw file):

			// poorly with our logic in `evaluateBatch`. Any idea @bdarnell or
			// @tbg?
			writeTimestamp.Forward(metaTimestamp.Next())

Does this writeTimestamp ever make it back to the txn in the response?


pkg/storage/engine/mvcc.go, line 1497 at r1 (raw file):

			} else {
				// Outside of a transaction, read the latest value and advance
				// the write timestamp to the latest value's timestamp + 1. The

This part of the comment seems out of place because this was already done (also before this PR).


pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):

				// Outside of a transaction, read the latest value and advance
				// the write timestamp to the latest value's timestamp + 1. The
				// new timestamp is returned to the caller in maybeTooOldErr.

Confused by this. So we do actually write but then also return maybeTooOldErr? I doubt that.

@nvb nvb force-pushed the nvanbenschoten/writeTooOldTs branch from ab8ccbf to 5e2f362 Compare February 20, 2019 23:42
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/storage/engine/mvcc.go, line 1466 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not necessarily, right? So far we only know that there's a committed write above the read timestamp (but the write timestamp could be above the committed value). Nvm, you mull on that below.

Ack.


pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

For snapshot transactions, this is just lost update -- two txn start at origts=100, their commit ts advances to 150 and 200 respectively, and then txn1 writes key=value1 (and commits) and txn2 cputs key from nil to value2, in effect wiping out value1 without ever having to observe it. I also don't see how this applies to serializable txns, though. As you mention, if we lay down an intent we're going to have to refresh anyway, and that should see the update that would be lost and force a restart. Worth getting to the bottom of even though I'm not sure avoiding the WTO is a good idea here. After all, what are the odds the write will survive the refresh? At that point we might as well check right here whether it would survive a refresh, though that seems to be optimizing for something we don't know is worth the added complexity.

PS this method probably has a cyclomatic complexity of MaxUint64... not that I see any easy way of changing that.

Mystery solved!


pkg/storage/engine/mvcc.go, line 1484 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does this writeTimestamp ever make it back to the txn in the response?

Yes, the writeTimestamp makes it back into the txn response here:

ba.Txn.Timestamp.Forward(tErr.ActualTimestamp)


pkg/storage/engine/mvcc.go, line 1497 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This part of the comment seems out of place because this was already done (also before this PR).

Updated.


pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Confused by this. So we do actually write but then also return maybeTooOldErr? I doubt that.

It's surprising, but yes we do. Dating all the way back to 4220d4a!

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @nvanbenschoten)


pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mystery solved!

How does this mechanism work exactly? If a CPut runs into a WTOErr, it "locally" (store?) retries the batch at the suggested timestamp and then returns the txn retry error? How does your attempted change break that? Shouldn't we be removing the WTO, but still returning a txn that needs to refresh if it wants to have any chance of committing? Why does it get to commit anyway?


pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's surprising, but yes we do. Dating all the way back to 4220d4a!

But we're not in a transaction in this branch? The write must not be committed, so why do it?
This is probably not the right PR to be pulling on this thread, but add a TODO or some explanation, if applicable.

…rrors

Fixes cockroachdb#34853.

This change fixes cockroachdb#34853, which appears to have been broken by
cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special purpose code to forward the
`MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the
transaction because it assumed that the new `writeTimestamp` variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this `writeTimestamp` variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates `WriteTooOldError`s). In
this case, we were replacing the `writeTimestamp` with the committed value's
timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/writeTooOldTs branch from 5e2f362 to 6b49279 Compare February 21, 2019 15:27
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

How does this mechanism work exactly? If a CPut runs into a WTOErr, it "locally" (store?) retries the batch at the suggested timestamp and then returns the txn retry error? How does your attempted change break that? Shouldn't we be removing the WTO, but still returning a txn that needs to refresh if it wants to have any chance of committing? Why does it get to commit anyway?

The mechanism is all in evaluateBatch. The idea isn't to locally retry anything. Instead, that function's purpose is to run all commands in a batch even if they are throwing WTO errors to determine the highest timestamp needed for later retries.

The function then makes a few decision. First, for non-transactional requests, it always returns the largest WriteTooOldError. So to answer the question below, we will still return the error instead of committing, but we will wait until the entire non-transactional batch has finished evaluating so that we can return the "best" error.

For transactional requests, things are more interesting. For suitable request types like blind Puts, it sets the batch responses timestamp and its WriteTooOld flag. It can then go ahead and writes these intents. However, for others that are not blind like CPut, it returns the WriteTooOld error immediately instead of writing the intents. This allows these request types to avoid ever needing to be refreshed. This is also what broke when I changed this to rely on CPuts refreshing.

I added some more commentary in the code.


pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But we're not in a transaction in this branch? The write must not be committed, so why do it?
This is probably not the right PR to be pulling on this thread, but add a TODO or some explanation, if applicable.

Answered above.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Feb 21, 2019

bors r=tbg

craig bot pushed a commit that referenced this pull request Feb 21, 2019
35081: sql: Fix test flake r=andy-kimball a=andy-kimball

Filter out PushTxn/ResolveIntent trace messages, since they represent intent
cleanup from earlier operations that interferes with subsequent tests. This
was already being done in other variations, so this commit extends that to
all the variations that are doing filtering of messages.

Fixes #34911

Release note: None

35086: mvcc: don't write below provisional commit timestamp on WriteTooOld errors r=tbg a=nvanbenschoten

Fixes #34853.

This change fixes #34853, which appears to have been broken by
57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special-purpose code to forward the
`MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the
transaction because it assumed that the new `writeTimestamp` variable that the
change introduced already assumed this role.

Unfortunately, the change allowed this `writeTimestamp` variable to regress
below the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates `WriteTooOldError`s). In
this case, we were replacing the `writeTimestamp` with the committed value's
timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look
like they were ignoring closed timestamps.

The fix was to either introduce the special purpose code again or to avoid
letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR
chose the latter solution.

In doing so, this PR also improves upon the benefit already provided by #32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.

Release note: None

35112: kv: assert an absence of retries in TestTxnCoordSenderRetries  r=tbg,andreimatei a=nvanbenschoten

The test was only asserting that cases that should observe retries did, not the cases that should not observe retries did not.

The PR also fixes `TestPropagateTxnOnError`, which wasn't really testing anything. We now set txn.Writing on the client, so it wasn't verifying what it thought it was. We switch it to assert that observed timestamps are propagated on error instead.

35118: storage: Downgrade a spammy log message r=tbg a=bdarnell

This log message only occurs on clusters with multiple stores per
node, but on those clusters it fires multiple times per second.

Release note: None

Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 21, 2019

Build succeeded

@craig craig bot merged commit 6b49279 into cockroachdb:master Feb 21, 2019
@nvb nvb deleted the nvanbenschoten/writeTooOldTs branch February 21, 2019 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: cdc/tpcc-1000/rangefeed=true failed

3 participants