mvcc: don't write below provisional commit timestamp on WriteTooOld errors#35086
Conversation
tbg
left a comment
There was a problem hiding this comment.
Great that you got to the bottom of this so fast! Man, that method is hard to follow. I gave this what I hope is a thorough review but will give it another one when you've figured out the lost update.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, and @tbg)
pkg/storage/engine/mvcc.go, line 1466 at r1 (raw file):
} } else if !metaTimestamp.Less(readTimestamp) { // This is the case where we're trying to write under a committed
Not necessarily, right? So far we only know that there's a committed write above the read timestamp (but the write timestamp could be above the committed value). Nvm, you mull on that below.
pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):
// we'll need to refresh anyway, right? It might also be interacting // poorly with our logic in `evaluateBatch`. Any idea @bdarnell or // @tbg?
For snapshot transactions, this is just lost update -- two txn start at origts=100, their commit ts advances to 150 and 200 respectively, and then txn1 writes key=value1 (and commits) and txn2 cputs key from nil to value2, in effect wiping out value1 without ever having to observe it. I also don't see how this applies to serializable txns, though. As you mention, if we lay down an intent we're going to have to refresh anyway, and that should see the update that would be lost and force a restart. Worth getting to the bottom of even though I'm not sure avoiding the WTO is a good idea here. After all, what are the odds the write will survive the refresh? At that point we might as well check right here whether it would survive a refresh, though that seems to be optimizing for something we don't know is worth the added complexity.
PS this method probably has a cyclomatic complexity of MaxUint64... not that I see any easy way of changing that.
pkg/storage/engine/mvcc.go, line 1484 at r1 (raw file):
// poorly with our logic in `evaluateBatch`. Any idea @bdarnell or // @tbg? writeTimestamp.Forward(metaTimestamp.Next())
Does this writeTimestamp ever make it back to the txn in the response?
pkg/storage/engine/mvcc.go, line 1497 at r1 (raw file):
} else { // Outside of a transaction, read the latest value and advance // the write timestamp to the latest value's timestamp + 1. The
This part of the comment seems out of place because this was already done (also before this PR).
pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):
// Outside of a transaction, read the latest value and advance // the write timestamp to the latest value's timestamp + 1. The // new timestamp is returned to the caller in maybeTooOldErr.
Confused by this. So we do actually write but then also return maybeTooOldErr? I doubt that.
ab8ccbf to
5e2f362
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)
pkg/storage/engine/mvcc.go, line 1466 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not necessarily, right? So far we only know that there's a committed write above the read timestamp (but the write timestamp could be above the committed value). Nvm, you mull on that below.
Ack.
pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
For snapshot transactions, this is just lost update -- two txn start at origts=100, their commit ts advances to 150 and 200 respectively, and then txn1 writes key=value1 (and commits) and txn2 cputs key from nil to value2, in effect wiping out value1 without ever having to observe it. I also don't see how this applies to serializable txns, though. As you mention, if we lay down an intent we're going to have to refresh anyway, and that should see the update that would be lost and force a restart. Worth getting to the bottom of even though I'm not sure avoiding the WTO is a good idea here. After all, what are the odds the write will survive the refresh? At that point we might as well check right here whether it would survive a refresh, though that seems to be optimizing for something we don't know is worth the added complexity.
PS this method probably has a cyclomatic complexity of MaxUint64... not that I see any easy way of changing that.
Mystery solved!
pkg/storage/engine/mvcc.go, line 1484 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does this writeTimestamp ever make it back to the txn in the response?
Yes, the writeTimestamp makes it back into the txn response here:
cockroach/pkg/storage/replica_evaluate.go
Line 288 in df358ac
pkg/storage/engine/mvcc.go, line 1497 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This part of the comment seems out of place because this was already done (also before this PR).
Updated.
pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Confused by this. So we do actually write but then also return maybeTooOldErr? I doubt that.
It's surprising, but yes we do. Dating all the way back to 4220d4a!
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r1, 1 of 1 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @nvanbenschoten)
pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mystery solved!
How does this mechanism work exactly? If a CPut runs into a WTOErr, it "locally" (store?) retries the batch at the suggested timestamp and then returns the txn retry error? How does your attempted change break that? Shouldn't we be removing the WTO, but still returning a txn that needs to refresh if it wants to have any chance of committing? Why does it get to commit anyway?
pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's surprising, but yes we do. Dating all the way back to 4220d4a!
But we're not in a transaction in this branch? The write must not be committed, so why do it?
This is probably not the right PR to be pulling on this thread, but add a TODO or some explanation, if applicable.
…rrors Fixes cockroachdb#34853. This change fixes cockroachdb#34853, which appears to have been broken by cockroachdb@57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514. That change made it so that intents are now written at a transaction's provisional commit timestamp instead of its original timestamp (which is fantastic!). In doing so, it removed special purpose code to forward the `MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the transaction because it assumed that the new `writeTimestamp` variable that the change introduced already assumed this role. Unfortunately, the change allowed this `writeTimestamp` variable to regress below the transaction's provisional commit timestamp in subtle cases where a committed value ended up between the transaction's original timestamp and its provisional commit timestamp (a case that generates `WriteTooOldError`s). In this case, we were replacing the `writeTimestamp` with the committed value's timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look like they were ignoring closed timestamps. The fix was to either introduce the special purpose code again or to avoid letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR chose the latter solution. In doing so, this PR also improves upon the benefit already provided by cockroachdb#32487 - it reduces the number of intents that need to be moved on transaction commit. This is because it ensures that if a transaction has an original (read) timestamp below a committed value and a provisional commit (write) timestamp above a committed value, an intent left by that transaction will be in a "committable" state without a need to rewrite it during intent resolution. Release note: None
5e2f362 to
6b49279
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)
pkg/storage/engine/mvcc.go, line 1483 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
How does this mechanism work exactly? If a CPut runs into a WTOErr, it "locally" (store?) retries the batch at the suggested timestamp and then returns the txn retry error? How does your attempted change break that? Shouldn't we be removing the WTO, but still returning a txn that needs to refresh if it wants to have any chance of committing? Why does it get to commit anyway?
The mechanism is all in evaluateBatch. The idea isn't to locally retry anything. Instead, that function's purpose is to run all commands in a batch even if they are throwing WTO errors to determine the highest timestamp needed for later retries.
The function then makes a few decision. First, for non-transactional requests, it always returns the largest WriteTooOldError. So to answer the question below, we will still return the error instead of committing, but we will wait until the entire non-transactional batch has finished evaluating so that we can return the "best" error.
For transactional requests, things are more interesting. For suitable request types like blind Puts, it sets the batch responses timestamp and its WriteTooOld flag. It can then go ahead and writes these intents. However, for others that are not blind like CPut, it returns the WriteTooOld error immediately instead of writing the intents. This allows these request types to avoid ever needing to be refreshed. This is also what broke when I changed this to rely on CPuts refreshing.
I added some more commentary in the code.
pkg/storage/engine/mvcc.go, line 1498 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
But we're not in a transaction in this branch? The write must not be committed, so why do it?
This is probably not the right PR to be pulling on this thread, but add a TODO or some explanation, if applicable.
Answered above.
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @danhhz)
|
bors r=tbg |
35081: sql: Fix test flake r=andy-kimball a=andy-kimball Filter out PushTxn/ResolveIntent trace messages, since they represent intent cleanup from earlier operations that interferes with subsequent tests. This was already being done in other variations, so this commit extends that to all the variations that are doing filtering of messages. Fixes #34911 Release note: None 35086: mvcc: don't write below provisional commit timestamp on WriteTooOld errors r=tbg a=nvanbenschoten Fixes #34853. This change fixes #34853, which appears to have been broken by 57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514. That change made it so that intents are now written at a transaction's provisional commit timestamp instead of its original timestamp (which is fantastic!). In doing so, it removed special-purpose code to forward the `MVCCLogicalOpDetails` timestamp to the provisional commit timestamp of the transaction because it assumed that the new `writeTimestamp` variable that the change introduced already assumed this role. Unfortunately, the change allowed this `writeTimestamp` variable to regress below the transaction's provisional commit timestamp in subtle cases where a committed value ended up between the transaction's original timestamp and its provisional commit timestamp (a case that generates `WriteTooOldError`s). In this case, we were replacing the `writeTimestamp` with the committed value's timestamp + 1 logical tick. This regression allowed `MVCCWriteIntentOp`s to look like they were ignoring closed timestamps. The fix was to either introduce the special purpose code again or to avoid letting `writeTimestamp` regress in this `WriteTooOldError` code path. This PR chose the latter solution. In doing so, this PR also improves upon the benefit already provided by #32487 - it reduces the number of intents that need to be moved on transaction commit. This is because it ensures that if a transaction has an original (read) timestamp below a committed value and a provisional commit (write) timestamp above a committed value, an intent left by that transaction will be in a "committable" state without a need to rewrite it during intent resolution. Release note: None 35112: kv: assert an absence of retries in TestTxnCoordSenderRetries r=tbg,andreimatei a=nvanbenschoten The test was only asserting that cases that should observe retries did, not the cases that should not observe retries did not. The PR also fixes `TestPropagateTxnOnError`, which wasn't really testing anything. We now set txn.Writing on the client, so it wasn't verifying what it thought it was. We switch it to assert that observed timestamps are propagated on error instead. 35118: storage: Downgrade a spammy log message r=tbg a=bdarnell This log message only occurs on clusters with multiple stores per node, but on those clusters it fires multiple times per second. Release note: None Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>
Build succeeded |
Fixes #34853.
This change fixes #34853, which appears to have been broken by
57d0201#diff-41e9e1b7829f8e25daeb564f42c26017L1514.
That change made it so that intents are now written at a transaction's
provisional commit timestamp instead of its original timestamp (which is
fantastic!). In doing so, it removed special-purpose code to forward the
MVCCLogicalOpDetailstimestamp to the provisional commit timestamp of thetransaction because it assumed that the new
writeTimestampvariable that thechange introduced already assumed this role.
Unfortunately, the change allowed this
writeTimestampvariable to regressbelow the transaction's provisional commit timestamp in subtle cases where a
committed value ended up between the transaction's original timestamp and its
provisional commit timestamp (a case that generates
WriteTooOldErrors). Inthis case, we were replacing the
writeTimestampwith the committed value'stimestamp + 1 logical tick. This regression allowed
MVCCWriteIntentOps to looklike they were ignoring closed timestamps.
The fix was to either introduce the special purpose code again or to avoid
letting
writeTimestampregress in thisWriteTooOldErrorcode path. This PRchose the latter solution.
In doing so, this PR also improves upon the benefit already provided by #32487 -
it reduces the number of intents that need to be moved on transaction commit.
This is because it ensures that if a transaction has an original (read)
timestamp below a committed value and a provisional commit (write) timestamp
above a committed value, an intent left by that transaction will be in a
"committable" state without a need to rewrite it during intent resolution.
Release note: None