rfc: Read Committed Isolation by nvb · Pull Request #100608 · cockroachdb/cockroach

nvb · 2023-04-04T16:27:30Z

Read the full RFC here

Concurrency control comprises the mechanisms that a database system employs to guarantee the "correct" execution of concurrent operations. Isolation levels provide concurrency control with the requirements for correctness — defining how and when the changes made by one transaction become visible to other transactions.

Strong isolation levels provide a high degree of isolation between concurrent transactions. They limit or eliminate the forms of concurrency effects that transactions may observe.

Weak isolation levels are more permissive. They trade off isolation guarantees for improved performance. Transactions run under weaker isolation levels block less and encounter fewer aborts (retry errors). In some systems, they also perform less work.

This RFC proposes the implementation of the Read Committed isolation level in CockroachDB.

BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED

Read Committed is a common, relatively weak isolation level found in most legacy database systems, including PostgreSQL. In fact, it is the default isolation level in PostgreSQL and most other legacy database systems.

Critically for this proposal, it is the strongest isolation level present in PostgreSQL that does not experience serialization failure errors and require applications to handle these errors using application-side retry logic.

By providing users with the option to run transactions under Read Committed, we provide them with the option to avoid transaction retry logic in their applications and make migrations to CockroachDB easier.

Epic: CRDB-23092
Release note: None

cockroach-teamcity · 2023-04-04T16:27:42Z

This change is

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @DrewKimball, @michae2, @nvanbenschoten, @rafiss, @rytaft, and @tbg)

docs/RFCS/20230122_read_committed_isolation.md line 46 at r1 (raw file):

applications and make migrations to CockroachDB easier.

# Motivation

Another motivating factor (that was non-obvious to me when we discussed it earlier) is that read committed can also prevent retry errors for reasons other than serializability, and specifically for clock uncertainty errors (just retry the statement at a higher timestamp). Applications vary in how often they see serializability errors, and just like with deadlock, being careful about data access patterns can help to minimize them. But clock uncertainty errors can strike randomly at any time (even on read-only txns), and for many apps this was their biggest source of retry-related pain. READ COMMITTED solves this problem even though it's not strictly related to the isolation level (note that SNAPSHOT isolation was still subject to clock uncertainty errors, which was one of the reasons we removed it instead of fixing its bugs - it wasn't good enough to avoid the need for retry loops).

docs/RFCS/20230122_read_committed_isolation.md line 315 at r1 (raw file):

Foreign key existence checks are an example of such a system-level consistency
check. Under the hood, PostgreSQL runs a `SELECT FOR SHARE` query that [looks
like this](https://github.com/postgres/postgres/blob/75f49221c22286104f032827359783aa5f4e6646/src/backend/utils/adt/ri_triggers.c#L363).

Thanks for tracking this down! I knew it must be something like that but was never sure what (nit: it's FOR KEY SHARE, not FOR SHARE, which further narrows the scope of the lock a bit).

docs/RFCS/20230122_read_committed_isolation.md line 493 at r1 (raw file):

hlc.MaxOffset()` and clearing all `ObservedTimestamps`.

We propose that Read Committed transactions do not do this. The cost of

To be clear, READ COMMITTED transactions would still provide "no stale reads" at the transaction level, it just wouldn't refresh at each statement, correct?

docs/RFCS/20230122_read_committed_isolation.md line 696 at r1 (raw file):

Instead of waiting on these conflicting intents until the intent's transaction
completes, the non-locking readers will perform a "status check" RPC in the form
of a `PushTxn(PUSH_TIMESTAMP)` to the transaction record of encountered intent

This behavior also exists today in write-read conflicts when the writer has a lower priority than the reader. I think the proposed behavior is fine, but it does raise the question of how exactly this interacts with transaction priorities.

docs/RFCS/20230122_read_committed_isolation.md line 813 at r1 (raw file):

prior effects of the statement will be rolled back on retry through the use of
[transaction
savepoints](https://www.cockroachlabs.com/docs/stable/savepoint.html). A

What is the performance impact of a savepoint per statement?

docs/RFCS/20230122_read_committed_isolation.md line 1009 at r1 (raw file):

for two reasons. First, applications that use `SELECT FOR SHARE` will behave
incorrectly if we continue to treat the locking strength as a no-op. Second,
CockroachDB's SQL layer will begin to use `SELECT FOR SHARE` internally to

Maybe a question for Arul's RFC, but what about SELECT FOR KEY SHARE, which is what postgres uses for this purpose? Would that be possible and useful for us to implement?

docs/RFCS/20230122_read_committed_isolation.md line 1398 at r1 (raw file):

[Foreign key checks](https://github.com/cockroachdb/cockroach/issues/80683) will
change to use `SELECT FOR SHARE` locks.

Would SELECT FOR SHARE be used all the time or only in read commited mode? Would these locks be immediately replicated or, following the same logic as the "implicit SFU" optimization above, would they continue to be unreplicated and checked at commit time?

docs/RFCS/20230122_read_committed_isolation.md line 1410 at r1 (raw file):

`UNIQUE WITHOUT INDEX` checks cannot easily be implemented correctly under Read
Committed using only row locking, because they depend on the non-existence of a
span of rows. Initially we will disallow `UNIQUE WITHOUT INDEX` checks under

What does it mean to disallow the check? It will be forbidden to insert into a table that uses such a constraint in READ COMMITTED mode?

And to spell out the implications of this, it won't be a problem for migrations per se because the migrating applications won't have UNIQUE WITHOUT INDEX constraints, but it will be an obstacle to allowing apps that rely on READ COMMITTED to adopt certain multi-region features (until the work in the next sentence is done)

docs/RFCS/20230122_read_committed_isolation.md line 1477 at r1 (raw file):

them and provides applications with Serializable isolation instead. This poses a
risk that applications which use CockroachDB today are unwittingly asking for
Read Committed, either directly or indirectly (e.g. through an ORM). These same

Do we know of any ORMs that might be doing this by default? We should audit our top ORMs and drivers to see if any of them are explicitly setting isolation by default instead of just going with the default picked by the server. (I'm not expecting this to be a problem because setting a default isolation level on the server side is fairly common in my experience, but you never know when a default will be set in some weird place)

docs/RFCS/20230122_read_committed_isolation.md line 1574 at r1 (raw file):

Proscribing `AS OF SYSTEM TIME` in Read Committed transactions would cause
confusion and inconvenience for users of CockroachDB, especially when attempting

Banning AOST would be especially inconvenient for users who make READ COMMITTED the default.

I think the behavior described here is right for AOST in the BEGIN or SET statement. But if the AOST clause appears in a SELECT statement in a READ COMMITTED transaction, I think the better behavior would be to set the timestamp only for that statement, but leave the transaction in per-statement snapshot mode so an AOST read could be followed by a regular RC statement. This feels like a better fit for RC expectations.

It's not actually that useful to do it this way at the statement level (it is meaningless that the AOST statement is "in the transaction"), but it would be useful if we took this a step further and allowed AOST in subqueries in RC mode.

docs/RFCS/20230122_read_committed_isolation.md line 1644 at r1 (raw file):

resolution of [#71946](https://github.com/cockroachdb/cockroach/issues/71946).

# Testing

A general note about testing: Our test suite (and tools like Elle) are mainly focused on ensuring that anomalies do not occur. For RC, we also need to make sure that we see the other properties we want (no retries, non-blocking reads). This will likely require a lot of new testing, not just adapting existing tests.

docs/RFCS/20230122_read_committed_isolation.md line 1709 at r1 (raw file):

Finally, existing workloads that run against Read Committed in other DBMS
systems will be solicited from customers. Where possible, these will be run
against CockroachDB's implementation of Read Committed to

Unfinished sentence

docs/RFCS/20230122_read_committed_isolation.md line 1736 at r1 (raw file):

Intra-Mutation Consistency might be implemented in CockroachDB.

# Drawbacks

I'd like to see a section on performance. There are a number of ways that this proposal affects performance, both good (less wasted work doing retries, no pre-commit span refreshes), and bad (lock replication, savepoint overhead, explicit locks for FK checks). Aside from the correctness concerns, how can we characterize the net expected performance of the two isolation levels?

There are also more subtle performance-related risks: Many applications have isolation-related bugs that go undetected because the app simply doesn't get enough traffic to hit the necessary race conditions. If your sub-millisecond operations suddenly start to take longer because foreign key checks now involve multiple RPCs, these latent bugs may be exposed.

docs/RFCS/20230122_read_committed_isolation.md line 1751 at r1 (raw file):

# Unresolved questions

## Should we implement Snapshot isolation (REPEATABLE READ) at the same time?

This is two separate questions: should we implement snapshot isolation, and if so should we give you snapshot isolation if you ask for repeatable read? Adya would argue no, that apps asking for RR are declaring their tolerance for phantom reads but not write skew and so would be unpleasantly surprised if we gave them snapshot. Practically speaking, however, there is ample precedent for conflating the two and no one really notices.

I agree that whatever the answer, we shouldn't do it yet and focus on RC only. (FWIW, I'd argue against exposing snapshot mode - so much of the benefit of RC comes from its ability to avoid retries by updating a per-statement snapshot. The cost/benefit of snapshot isolation isn't the same)

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @DrewKimball, @michae2, @rafiss, @rytaft, and @tbg)

docs/RFCS/20230122_read_committed_isolation.md line 46 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Another motivating factor (that was non-obvious to me when we discussed it earlier) is that read committed can also prevent retry errors for reasons other than serializability, and specifically for clock uncertainty errors (just retry the statement at a higher timestamp). Applications vary in how often they see serializability errors, and just like with deadlock, being careful about data access patterns can help to minimize them. But clock uncertainty errors can strike randomly at any time (even on read-only txns), and for many apps this was their biggest source of retry-related pain. READ COMMITTED solves this problem even though it's not strictly related to the isolation level (note that SNAPSHOT isolation was still subject to clock uncertainty errors, which was one of the reasons we removed it instead of fixing its bugs - it wasn't good enough to avoid the need for retry loops).

I adjusted this section to call this out explicitly.

docs/RFCS/20230122_read_committed_isolation.md line 315 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Thanks for tracking this down! I knew it must be something like that but was never sure what (nit: it's FOR KEY SHARE, not FOR SHARE, which further narrows the scope of the lock a bit).

Updated to mention SELECT FOR KEY SHARE, which is mentioned in the SHARED lock RFC.

docs/RFCS/20230122_read_committed_isolation.md line 493 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

To be clear, READ COMMITTED transactions would still provide "no stale reads" at the transaction level, it just wouldn't refresh at each statement, correct?

Correct. Updated the text to be more clear.

docs/RFCS/20230122_read_committed_isolation.md line 696 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This behavior also exists today in write-read conflicts when the writer has a lower priority than the reader. I think the proposed behavior is fine, but it does raise the question of how exactly this interacts with transaction priorities.

The way I've been imagining this interactive with txn priorities is that we redefine the rules so that PUSH_TIMESTAMP can succeed for equal priorities but will still block if the priority "class" of the pusher is less than that of the pushee. A consequence of that is that a high-priority reader will never block on any other transaction, which is why this will solve #71946.

See the change to CanPushWithPriority in nvb@9190180#diff-4094642b7aac6406063e650bb0c148bffff83eb9fb9eb18ef9f845481ff3702b (a WIP branch).

docs/RFCS/20230122_read_committed_isolation.md line 813 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What is the performance impact of a savepoint per statement?

I believe it will be negligible. There is a small amount of client-side bookkeeping during the creation of a savepoint and the successful release of a savepoint, but nothing that is remote or impacts the transaction proto. When rolled back, there is moderately more work (and much more if we address #94731), but that's on the unhappy retry path.

docs/RFCS/20230122_read_committed_isolation.md line 1009 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Maybe a question for Arul's RFC, but what about SELECT FOR KEY SHARE, which is what postgres uses for this purpose? Would that be possible and useful for us to implement?

@arulajmani and I have discussed this. There's a mention of it in the SHARED lock RFC. That RFC does not propose that we immediately implement FOR KEY SHARE, but leaves the door open for it. The more difficult part of that project will be implicitly detecting the FOR NO KEY UPDATE locking strength during mutations.

docs/RFCS/20230122_read_committed_isolation.md line 1398 at r1 (raw file):

Would SELECT FOR SHARE be used all the time or only in read commited mode?

I think we'll start by only using them in RC mode just so we don't have to fight to "no regressions" fight immediately, but longer term, I'd like to minimize the differences between isolation levels. If we can do that after we optimize these locks (see Variants and Optimizations and Reliability for Preview Release), we'll have an easier time.

docs/RFCS/20230122_read_committed_isolation.md line 1410 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does it mean to disallow the check? It will be forbidden to insert into a table that uses such a constraint in READ COMMITTED mode?

And to spell out the implications of this, it won't be a problem for migrations per se because the migrating applications won't have UNIQUE WITHOUT INDEX constraints, but it will be an obstacle to allowing apps that rely on READ COMMITTED to adopt certain multi-region features (until the work in the next sentence is done)

Called all of this out in the text.

docs/RFCS/20230122_read_committed_isolation.md line 1477 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Do we know of any ORMs that might be doing this by default? We should audit our top ORMs and drivers to see if any of them are explicitly setting isolation by default instead of just going with the default picked by the server. (I'm not expecting this to be a problem because setting a default isolation level on the server side is fairly common in my experience, but you never know when a default will be set in some weird place)

Agreed, we'll want to audit our top ORMs and understand what affect this will have on them. cc. @rafiss

docs/RFCS/20230122_read_committed_isolation.md line 1574 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Banning AOST would be especially inconvenient for users who make READ COMMITTED the default.

I think the behavior described here is right for AOST in the BEGIN or SET statement. But if the AOST clause appears in a SELECT statement in a READ COMMITTED transaction, I think the better behavior would be to set the timestamp only for that statement, but leave the transaction in per-statement snapshot mode so an AOST read could be followed by a regular RC statement. This feels like a better fit for RC expectations.

It's not actually that useful to do it this way at the statement level (it is meaningless that the AOST statement is "in the transaction"), but it would be useful if we took this a step further and allowed AOST in subqueries in RC mode.

That's an interesting idea! I called it out in the text.

docs/RFCS/20230122_read_committed_isolation.md line 1644 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A general note about testing: Our test suite (and tools like Elle) are mainly focused on ensuring that anomalies do not occur. For RC, we also need to make sure that we see the other properties we want (no retries, non-blocking reads). This will likely require a lot of new testing, not just adapting existing tests.

Good point. I'd be interested to get your thoughts on ways to exhaustively test these other properties beyond what's mentioned here.

For retry errors, I've mostly been thinking that we remove retry loops from applications, run them under Read Committed, and ensure that they don't throw errors. Integration testing for this property needs to live above SQL, because that's where the internal retry loop lives. The PG isolation test suite may also come in handy here.

For non-blocking reads, narrower testing will be possible. For instance, even testing right around the kv/kvsever/concurrency package will provide some confidence. Beyond that, PG's isolation test suite may also provide value.

docs/RFCS/20230122_read_committed_isolation.md line 1709 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Unfinished sentence

Done.

docs/RFCS/20230122_read_committed_isolation.md line 1736 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd like to see a section on performance. There are a number of ways that this proposal affects performance, both good (less wasted work doing retries, no pre-commit span refreshes), and bad (lock replication, savepoint overhead, explicit locks for FK checks). Aside from the correctness concerns, how can we characterize the net expected performance of the two isolation levels?

There are also more subtle performance-related risks: Many applications have isolation-related bugs that go undetected because the app simply doesn't get enough traffic to hit the necessary race conditions. If your sub-millisecond operations suddenly start to take longer because foreign key checks now involve multiple RPCs, these latent bugs may be exposed.

This is a good idea. I added a stub for the section, but have not filled it in yet.

docs/RFCS/20230122_read_committed_isolation.md line 1751 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is two separate questions: should we implement snapshot isolation, and if so should we give you snapshot isolation if you ask for repeatable read? Adya would argue no, that apps asking for RR are declaring their tolerance for phantom reads but not write skew and so would be unpleasantly surprised if we gave them snapshot. Practically speaking, however, there is ample precedent for conflating the two and no one really notices.

I agree that whatever the answer, we shouldn't do it yet and focus on RC only. (FWIW, I'd argue against exposing snapshot mode - so much of the benefit of RC comes from its ability to avoid retries by updating a per-statement snapshot. The cost/benefit of snapshot isolation isn't the same)

Split into two questions and addressed your points.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @DrewKimball, @michae2, @nvanbenschoten, @rafiss, @rytaft, and @tbg)

docs/RFCS/20230122_read_committed_isolation.md line 696 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The way I've been imagining this interactive with txn priorities is that we redefine the rules so that PUSH_TIMESTAMP can succeed for equal priorities but will still block if the priority "class" of the pusher is less than that of the pushee. A consequence of that is that a high-priority reader will never block on any other transaction, which is why this will solve #71946.

See the change to CanPushWithPriority in nvb@9190180#diff-4094642b7aac6406063e650bb0c148bffff83eb9fb9eb18ef9f845481ff3702b (a WIP branch).

I was thinking more about the case when the reader is lower priority: it sounds like read/write conflicts for READ COMMITTED transactions will still block if the pending writer is higher priority.

docs/RFCS/20230122_read_committed_isolation.md line 1644 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Good point. I'd be interested to get your thoughts on ways to exhaustively test these other properties beyond what's mentioned here.

For retry errors, I've mostly been thinking that we remove retry loops from applications, run them under Read Committed, and ensure that they don't throw errors. Integration testing for this property needs to live above SQL, because that's where the internal retry loop lives. The PG isolation test suite may also come in handy here.

For non-blocking reads, narrower testing will be possible. For instance, even testing right around the kv/kvsever/concurrency package will provide some confidence. Beyond that, PG's isolation test suite may also provide value.

It's hard to test exhaustively; I would tend to lean on statistical methods here. Check that the distribution of statements that succeed on the first/second/nth attempt fall within some expected range. The most important thing to avoid is the problem that we see sometimes with the jepsen tests, in which deadlocks and the like don't get counted as "failures", so a run with zero successes and failures still counts as a "pass".

I don't think we need to actively remove retry loops from applications (might as well keep it possible to run the tests with either isolation level), but instead we should instrument them to assert that client-side retries are never used.

rytaft

I've read through the Transaction Model section -- looks great! Just pushing a couple questions. Still reading the rest of the doc.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @DrewKimball, @michae2, @nvanbenschoten, @rafiss, and @tbg)

docs/RFCS/20230122_read_committed_isolation.md line 493 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Correct. Updated the text to be more clear.

I'm still confused -- how will this be enforced at the transaction level?

EDIT-- I think you explain this below -- that the uncertainty interval for the whole transaction doesn't change from statement to statement? I think you could make that a bit more explicit here.

docs/RFCS/20230122_read_committed_isolation.md line 681 at r2 (raw file):

we will implement the blocking behavior

Wait - but I thought you were getting rid of the blocking here?

knz

Thank you for the enjoyable read.

One portant thing that didn't get coverage yet is current and future uses of InternalExecutor. Many current uses perform mutations and change state in system tables while expecting:

to use the same kv.Txn as the surrounding SQL statement, so as to commit atomically with it;
serializable isolation to enforce certain properties of the system tables.

In particular the correctness of the jobs system and certain schema changes would be impeded by having the kv.Txn configured with read committed.

I believe the RFC should propose a 3-pronged approach:

the IE API should be changed so that callers must clarify explicitly when they are able to use read committed.
in a first implementation, calls to IE in a read committed txn by callers that do not support it should be rejected with a "feature not supported" error with appropriate reports to telemetry.
incrementally, we should them review all subsystems that use IE and either confirm the existing code works with read committed txns (then add the explicit arg) or change their logic (then also add the explicit flag)

knz · 2023-04-19T08:54:34Z

Regarding internal executor, i also have two additional questions:

how does pg do this? (Subsystems using SQL internally)
how many features (eg DDL inside explicit txn) both have an implementation currently incompatible with read committed, and also are needed by those customers to whom we have promised that read committed would solve their problems?

The answer to the latter question might reveal critical project dependencies on other teams which were not considered so far.

knz · 2023-04-19T09:30:42Z

Another thing I'd like to read about (did I miss it?) Is how read committed txns interact with table leases. Am i right to understand that table leases need to be refreshed on first use in each stmt, instead of on first use in each txn?

Then regarding conflicts on schema changes. IIRC pg does this thing where it locks rows in pg_catalog (with a shared lock) for the subset of a table's metadata that's presently needed. For example constraints are only locked for mutations. This ensures e.g. that certain DDL doesn't interfere with read only traffic, which some applications depend on. How much of this are we expecting to reach in this proposal? (This touches on my previous question about what customers may be expecting already from this work.)

rytaft

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @DrewKimball, @michae2, @nvanbenschoten, @rafiss, and @tbg)

docs/RFCS/20230122_read_committed_isolation.md line 1426 at r2 (raw file):

INDEX` checks in transactions run under Read Committed, so these transactions
will be unable to insert into tables with this form of constraint. Consequently,
`REGIONAL BY ROW` tables will be inaccessible to Read Committed transactions in

REGIONAL BY ROW tables don't use UNIQUE WITHOUT INDEX, so we should still be able to support them (they use implicitly partitioned indexes).

UNIQUE WITHOUT INDEX is an experimental feature that is mostly used for testing, so this limitation doesn't seem like a big concern (although it's good to call it out).

knz

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @DrewKimball, @michae2, @nvanbenschoten, and @rafiss)

docs/RFCS/20230122_read_committed_isolation.md line 69 at r3 (raw file):
One thing I'm confused about: what does "commonly" mean in this sentence?

Read Committed's isolation guarantees are weak enough that CockroachDB can commonly handle consistency-related retry errors internally without involvement from the application, eliminating the other reason for application-side retry.

Are there still some cases where the app needs to handle them?

I worry that unless we can obviate all needs for app-side handling, there will be some cases (e.g. in multi-region clusters) where the uncertainty errors will pop up, causing concern and churn with customers. Conversely, if the proposal in this RFC does not fully eliminate that source of errors, it should add a big fat disclaimer that the app changes will still be required for a full CockroachDB adoption (lest the app would still experience outages sometimes).

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @DrewKimball, @knz, @michae2, @rafiss, and @rytaft)

docs/RFCS/20230122_read_committed_isolation.md line 493 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I'm still confused -- how will this be enforced at the transaction level?

EDIT-- I think you explain this below -- that the uncertainty interval for the whole transaction doesn't change from statement to statement? I think you could make that a bit more explicit here.

Done.

docs/RFCS/20230122_read_committed_isolation.md line 696 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I was thinking more about the case when the reader is lower priority: it sounds like read/write conflicts for READ COMMITTED transactions will still block if the pending writer is higher priority.

This has been implemented in #103267.

docs/RFCS/20230122_read_committed_isolation.md line 681 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

we will implement the blocking behavior

Wait - but I thought you were getting rid of the blocking here?

Yeah, this was confusing. The absence of blocking is what I was referring to by "the blocking behavior". I rephrased this.

docs/RFCS/20230122_read_committed_isolation.md line 1426 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

REGIONAL BY ROW tables don't use UNIQUE WITHOUT INDEX, so we should still be able to support them (they use implicitly partitioned indexes).

UNIQUE WITHOUT INDEX is an experimental feature that is mostly used for testing, so this limitation doesn't seem like a big concern (although it's good to call it out).

Do implicitly partitioned indexes use the same underlying post-query mechanisms to verify uniqueness across all partitions?

docs/RFCS/20230122_read_committed_isolation.md line 69 at r3 (raw file):
Commonly means that the vast majority of consistency-related retry errors will not make it up to the client. The exception to this case is when an uncertainty error is encountered after the statement has begun streaming a result set back to the client and a refresh fails. This is discussed down below in the Consistency-Related Retries section.

There has been some discussion about using a larger default results_buffer_size under Read Committed (even up to spilling to disk) to ensure that in practice, these errors don't reach the client. To start, we plan to test with a diversity of applications to see whether the default sql.defaults.results_buffer.size (16KiB) is sufficient for most apps. Either way, you are correct that this will be documented alongside Read Committed.

Are there still some cases where the app needs to handle them?

For exactly the reasons you mentioned, the goal of this project is to make errors rare enough that apps do not need to include specific logic to handle them. If they still need error-handling logic in their app, we have not succeeded. That does not mean eliminating all sources of retry errors, but it means making them rare enough that they remain in the noise. There's precedent for this in other systems — even under Read Committed, lock ordering deadlocks can occur and cause transaction aborts. However, app developers are not encouraged to build their apps to handle these errors.

rytaft

Sorry for the unacceptably long delay! Just responded to a couple things.

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @DrewKimball, @knz, @michae2, @nvanbenschoten, and @rafiss)

docs/RFCS/20230122_read_committed_isolation.md line 681 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, this was confusing. The absence of blocking is what I was referring to by "the blocking behavior". I rephrased this.

Doesn't look like anything changed here.

docs/RFCS/20230122_read_committed_isolation.md line 1426 at r2 (raw file):

Do implicitly partitioned indexes use the same underlying post-query mechanisms to verify uniqueness across all partitions?

Yes they do -- but we can use point queries to lookup data in each partition rather than performing a full scan (as we would need to in UNIQUE WITHOUT INDEX). But I suppose the problem is still the same, in that we need to lock rows that don't exist.

This text is still incorrect since it implies that REGIONAL BY ROW relies on UNIQUE WITHOUT INDEX. It actually relies on PARTITION ALL BY under the hood.

CockroachDB's upcoming release will include preview support for the Read Committed and Repeatable Read isolation levels. This commit updates the Hermitage test suite for the new isolation levels, demonstrating the minimum isolation level required to prevent each anomaly. The only difference from Postgres is that CockroachDB's Read Committed level prevents "Predicate-Many-Preceders (PMP) for write predicates". This is because CockroachDB's Read Committed provides stronger isolation guarantees between the read and writes within mutation statements. For details, see "Per-Statement Read Snapshots" in the [Read Committed design document](cockroachdb/cockroach#100608).

**Concurrency control** comprises the mechanisms that a database system employs to guarantee the "correct" execution of concurrent operations. **Isolation levels** provide concurrency control with the requirements for correctness — defining how and when the changes made by one transaction become visible to other transactions. Strong isolation levels provide a high degree of isolation between concurrent transactions. They limit or eliminate the forms of concurrency effects that transactions may observe. Weak isolation levels are more permissive. They trade off isolation guarantees for improved performance. Transactions run under weaker isolation levels block less and encounter fewer aborts (retry errors). In some systems, they also perform less work. This RFC proposes the implementation of the **Read Committed** isolation level in CockroachDB. ``` BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED ``` Read Committed is a common, relatively weak isolation level found in most legacy database systems, including PostgreSQL. In fact, it is the default isolation level in PostgreSQL and most other legacy database systems. Critically for this proposal, it is the strongest isolation level present in PostgreSQL that does not experience [serialization failure errors](https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html) and require applications to handle these errors using application-side retry logic. By providing users with the option to run transactions under Read Committed, we provide them with the option to avoid transaction retry logic in their applications and make migrations to CockroachDB easier. Epic: CRDB-23092 Release note: None

blathers-crl · 2023-11-10T22:06:23Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

nvb · 2023-11-10T22:10:50Z

I've updated this RFC with @DrewKimball's fantastic outline of the "Post-Locking Predicate Re-evaluation" alternative approach which we didn't end up going with.

There are still some TODOs around query planning for SELECT FOR UPDATE statements which would be good to address now that #111662 has landed, but we can address those in follow up PRs.

For now, I'll go ahead and merge this with a status of "completed", because the vast majority of what was included in this design has now been implemented.

bors r+

craig · 2023-11-10T23:11:49Z

Build succeeded:

Bazel Essential CI (Cockroach)

nvb requested review from DrewKimball, arulajmani, michae2, rafiss and rytaft April 4, 2023 16:27

nvb requested a review from a team as a code owner April 4, 2023 16:27

nvb force-pushed the nvanbenschoten/readCommittedRFC branch from fcb7fcd to 966e5a6 Compare April 4, 2023 16:29

nvb requested review from bdarnell and tbg April 6, 2023 13:56

bdarnell reviewed Apr 6, 2023

View reviewed changes

nvb commented Apr 7, 2023

View reviewed changes

nvb force-pushed the nvanbenschoten/readCommittedRFC branch from 2bffa57 to e714d27 Compare April 7, 2023 18:09

bdarnell approved these changes Apr 10, 2023

View reviewed changes

rytaft reviewed Apr 19, 2023

View reviewed changes

knz reviewed Apr 19, 2023

View reviewed changes

rytaft reviewed Apr 20, 2023

View reviewed changes

nvb mentioned this pull request May 17, 2023

sql: support READ COMMITTED in transaction_isolation session vars #103482

Merged

tbg approved these changes May 22, 2023

View reviewed changes

knz reviewed May 23, 2023

View reviewed changes

nvb commented May 23, 2023

View reviewed changes

rytaft reviewed Aug 26, 2023

View reviewed changes

nvb mentioned this pull request Nov 5, 2023

workload/tpcc: support Read Committed isolation #113834

Merged

nvb mentioned this pull request Nov 10, 2023

Update CockroachDB to include Read Committed and Repeatable Read ept/hermitage#13

Merged

nvb force-pushed the nvanbenschoten/readCommittedRFC branch from cad1df0 to 1c522b9 Compare November 10, 2023 22:06

craig bot merged commit 6eeb4c6 into cockroachdb:master Nov 10, 2023

nvb deleted the nvanbenschoten/readCommittedRFC branch November 11, 2023 18:10

Conversation

nvb commented Apr 4, 2023 • edited by tbg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Read the full RFC here

Uh oh!

cockroach-teamcity commented Apr 4, 2023

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Apr 19, 2023

Uh oh!

knz commented Apr 19, 2023

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Nov 10, 2023

Uh oh!

nvb commented Nov 10, 2023

Uh oh!

craig bot commented Nov 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nvb commented Apr 4, 2023 •

edited by tbg

Loading