kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus#95911
kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus#95911craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
ea417d0 to
9128ec7
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Reviewed 9 of 9 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
-- commits line 6 at r1:
nit: "back to"
pkg/kv/kvserver/replica_tscache.go line 410 at r1 (raw file):
// PushTxn(ABORT). As a result of this mechanism, a pusher transaction never // needs to create the transaction record for contending transactions that // it forms write-write conflicts with.
The phrasing here leaves the handling of write-read conflicts ambiguous. Consider adding a reference to MinTxnCommitTS here, so that readers know where to look if they're curious about Txn record creation semantics when there's a write-read conflict.
pkg/kv/kvserver/replica_tscache.go line 449 at r1 (raw file):
// v | v EndTxn(!STAGING) v v | // +--------------------+ or PushTxn(ABORT) +--------------------+ // | | then: txn.ts -> v1 | |
txn.ts -> v1 should only happens when EndTxn(!Staging), right? It doesn't make sense in the PushTxn(ABORT) case -- should we split these cases out now?
This commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful PushTxn(TIMESTAMP) to a pushee with an existing record without re-writing its transaction record. See the changes to the state machine diagram attached to `Replica.CanCreateTxnRecord` for a visual depiction of this change.
This commit simplifies logic in PushTxnRequest that promoted a PUSH_TIMESTAMP to a PUSH_ABORT when it found a STAGING transaction record which it knew to be part of a failed parallel commit attempt. The logic was trying to be smart and minimize the cases where it needed to promote a PUSH_TIMESTAMP to a PUSH_ABORT. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where a PUSH_TIMESTAMP would succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them.
Fixes cockroachdb#94728. With the previous commit, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. This commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus.
9128ec7 to
3dbb321
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
bors r=arulajmani
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)
Previously, arulajmani (Arul Ajmani) wrote…
nit: "back to"
Done.
pkg/kv/kvserver/replica_tscache.go line 410 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
The phrasing here leaves the handling of write-read conflicts ambiguous. Consider adding a reference to
MinTxnCommitTShere, so that readers know where to look if they're curious about Txn record creation semantics when there's a write-read conflict.
Good idea, done.
pkg/kv/kvserver/replica_tscache.go line 449 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
txn.ts -> v1should only happens whenEndTxn(!Staging), right? It doesn't make sense in thePushTxn(ABORT)case -- should we split these cases out now?
Yeah, you're correct about that, good catch. We could forward the record's WriteTimestamp to MinTxnCommitTS() when aborting the txn on a PushTxn(ABORT), but we don't because the write timestamp is meaningless once the txn is aborted. I'm going to leave this as is out of a desire not to make this diagram any more crowded and harder to read than it already is while acknowledging that it is not entirely accurate in this specific case.
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build succeeded: |
This PR contains a sequence of three commits that combine to resolve #94728.
check txn push marker on commit, not txn record creation
The first commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful
PushTxn(TIMESTAMP)to a pushee with an existing record without rewriting its transaction record.For details, see the changes to the state machine diagram attached to
Replica.CanCreateTxnRecordfor a visual depiction of this change.always promote PUSH_TIMESTAMP to PUSH_ABORT on failed staging record
The second commit simplifies logic in PushTxnRequest that promoted a
PUSH_TIMESTAMPto aPUSH_ABORTwhen itfound a STAGING transaction record that it knew to be part of a failed parallel commit attempt.
The logic tried to be smart and minimize the cases where it needed to promote a
PUSH_TIMESTAMPto aPUSH_ABORT. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where aPUSH_TIMESTAMPwould succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them.don't rewrite txn record on PushTxn(TIMESTAMP)
With the previous two commits, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. The final commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus.
Release note: None