kv: implement commit-wait for future time transaction commits#61110
kv: implement commit-wait for future time transaction commits#61110craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
a722798 to
33ec88b
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 1388 at r2 (raw file):
require.Empty(t, errC) adv := futureOffset / 5
I don't really understand what's going on here
Generally I think some words about how the test is going to test the sleep duration would help
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 1402 at r2 (raw file):
} require.True(t, minClockTS.Less(s.Clock.Now())) } else {
nit: put the else branch first and return early
pkg/kv/kvclient/kvcoord/txn_metrics.go, line 26 at r2 (raw file):
Commits1PC *metric.Counter // Commits which finished in a single phase ParallelCommits *metric.Counter // Commits which entered the STAGING state CommitWaits *metric.Counter // Commits that waited for linearizability
"single-key linearizability" :)
pkg/kv/kvclient/kvcoord/txn_metrics.go, line 77 at r2 (raw file):
metaCommitWaitRates = metric.Metadata{ Name: "txn.commit_waits", Help: "Number of KV transactions that had to commit-wait on commit",
perhaps "Had to sleep on commit in order to ensure serializability. This generally happens to transactions writing to global ranges.".
pkg/util/hlc/hlc.go, line 490 at r2 (raw file):
} // SleepUntil sleeps until the given time.
consider adding more words to the comment emphasizing the all the caclulations are done in HLC time, not physical time
This was missed in a85115a. We were assuming that all calls to `commitReadOnlyTxnLocked` were for `EndTxn` requests with the Commit flag set to true, but this is not the case. This was not only confusing, but it was also leading to the `txn.commit` metric being incremented on rollback of a read-only transaction, instead of the `txn.aborts` metric. Release justification: bug fix
Fixes cockroachdb#57687. Related to cockroachdb#52745. This commit introduces a "commit-wait" sleep stage after a transaction commits, which is entered if doing so is deemed necessary for consistency. By default, commit-wait is only necessary for transactions that commit with a future-time timestamp that leads the local HLC clock. This is because CockroachDB's consistency model depends on all transactions waiting until their commit timestamp is below their gateway clock. In doing so, transactions ensure that at the time that they complete, all other clocks in the system (i.e. on all possible gateways) will be no more than the max_offset below the transaction's commit timestamp. This property ensures that all causally dependent transactions will have an uncertainty interval (see GlobalUncertaintyLimit) that exceeds the original transaction's commit timestamp, preventing stale reads. Without the wait, it would be possible for a read-write transaction to write a future-time value and then for a causally dependent transaction to read below that future-time value, violating "read your writes". The property must also hold for read-only transactions, which may have a commit timestamp in the future due to an uncertainty restart after observing a future-time value in their uncertainty interval. In such cases, the property that the transaction must wait for the local HLC clock to exceed its commit timestamp is not necessary to prevent stale reads, but it is necessary to ensure monotonic reads. Without the wait, it would be possible for a read-only transaction coordinated on a gateway with a fast clock to return a future-time value and then for a causally dependent read-only transaction coordinated on a gateway with a slow clock to read below that future-time value, violating "monotonic reads". In practice, most transactions do not need to wait at all, because their commit timestamps were pulled from an HLC clock (i.e. are not synthetic) and so they will be guaranteed to lead the local HLC's clock, assuming proper HLC time propagation. Only transactions whose commit timestamps were pushed into the future will need to wait, like those who wrote to a global_read range and got bumped by the closed timestamp or those who conflicted (write-read or write-write) with an existing future-time value. However, CockroachDB also supports a stricter model of consistency through its "linearizable" flag. When in linearizable mode (also known as "strict serializable" mode), all writing transactions (but not read-only transactions) must wait an additional max_offset after committing to ensure that their commit timestamp is below the current HLC clock time of any other node in the system. In doing so, all causally dependent transactions are guaranteed to start with higher timestamps, regardless of the gateway they use. This ensures that all causally dependent transactions commit with higher timestamps, even if their read and writes sets do not conflict with the original transaction's. This obviates the need for uncertainty intervals and prevents the "causal reverse" anamoly which can be observed by a third, concurrent transaction. For more, see https://www.cockroachlabs.com/blog/consistency-model/ and docs/RFCS/20200811_non_blocking_txns.md. Release notes: None Release justification: New functionality.
33ec88b to
7ec4212
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 1388 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't really understand what's going on here
Generally I think some words about how the test is going to test the sleep duration would help
Done.
pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 1402 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: put the
elsebranch first and return early
Done.
pkg/kv/kvclient/kvcoord/txn_metrics.go, line 26 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
"single-key linearizability" :)
Not if the --linearizability flag is set!
pkg/kv/kvclient/kvcoord/txn_metrics.go, line 77 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
perhaps "Had to sleep on commit in order to ensure serializability. This generally happens to transactions writing to global ranges.".
Done.
pkg/util/hlc/hlc.go, line 490 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider adding more words to the comment emphasizing the all the caclulations are done in HLC time, not physical time
Done.
|
bors r+ |
|
Build succeeded: |
Fixes #57687.
Related to #52745.
This PR introduces a "commit-wait" sleep stage after a transaction commits, which is entered if doing so is deemed necessary for consistency.
By default, commit-wait is only necessary for transactions that commit with a future-time timestamp that leads the local HLC clock. This is because CockroachDB's consistency model depends on all transactions waiting until their commit timestamp is below their gateway clock. In doing so, transactions ensure that at the time that they complete, all other clocks in the system (i.e. on all possible gateways) will be no more than the max_offset below the transaction's commit timestamp. This property ensures that all causally dependent transactions will have an uncertainty interval (see GlobalUncertaintyLimit) that exceeds the original transaction's commit timestamp, preventing stale reads. Without the wait, it would be possible for a read-write transaction to write a future-time value and then for a causally dependent transaction to read below that future-time value, violating "read your writes".
The property must also hold for read-only transactions, which may have a commit timestamp in the future due to an uncertainty restart after observing a future-time value in their uncertainty interval. In such cases, the property that the transaction must wait for the local HLC clock to exceed its commit timestamp is not necessary to prevent stale reads, but it is necessary to ensure monotonic reads. Without the wait, it would be possible for a read-only transaction coordinated on a gateway with a fast clock to return a future-time value and then for a causally dependent read-only transaction coordinated on a gateway with a slow clock to read below that future-time value, violating "monotonic reads".
In practice, most transactions do not need to wait at all, because their commit timestamps were pulled from an HLC clock (i.e. are not synthetic) and so they will be guaranteed to lead the local HLC's clock, assuming proper HLC time propagation. Only transactions whose commit timestamps were pushed into the future will need to wait, like those who wrote to a global_read range and got bumped by the closed timestamp or those who conflicted (write-read or write-write) with an existing future-time value.
However, CockroachDB also supports a stricter model of consistency through its "linearizable" flag. When in linearizable mode (also known as "strict serializable" mode), all writing transactions (but not read-only transactions) must wait an additional max_offset after committing to ensure that their commit timestamp is below the current HLC clock time of any other node in the system. In doing so, all causally dependent transactions are guaranteed to start with higher timestamps, regardless of the gateway they use. This ensures that all causally dependent transactions commit with higher timestamps, even if their read and writes sets do not conflict with the original transaction's. This obviates the need for uncertainty intervals and prevents the "causal reverse" anamoly which can be observed by a third, concurrent transaction.
For more, see https://www.cockroachlabs.com/blog/consistency-model/ and docs/RFCS/20200811_non_blocking_txns.md.
The PR also fixes a bug by properly marking read-only txn as aborted on rollback, which was missed in a85115a.
We were assuming that all calls to
commitReadOnlyTxnLockedwere forEndTxnrequests with the Commit flag set to true, but this is not the case. This was not only confusing, but it was also leading to thetxn.commitmetric being incremented on rollback of a read-only transaction, instead of thetxn.abortsmetric.Release justification: New functionality.