-
Notifications
You must be signed in to change notification settings - Fork 4.1k
sql: implement immutable tables that automatically use follower reads #26301
Description
Background
It's very common in real-world database schemas to have reference tables - infrequently-updated tables, often relatively small, that are used as foreign key tables in applications. Some examples:
- a table of zipcodes. zipcodes are very infrequently deleted or added
- the
itemstable fromtpcc. items in tpcc are never updated - a table of historical closing prices in a financial database. this table is updated once a day, but not during the day
These kinds of tables present some issues for CockroachDB. In CockroachDB, doing a read from a table requires going to the table's leaseholder, to make sure that the reads are consistent. In an application that plays to CockroachDB's strengths that has many, globally distributed datacenters, reference tables like this will cause poor performance in several situations:
- inserts into tables that have foreign keys to a reference table will be at least as slow as a read to the leaseholder of the reference table, which might live across the globe
- joins from tables to a reference table will require going to the leaseholder, which might live across the globe
This may be unacceptable for customers who desire low-latency reads against reference tables and are willing to sacrifice write performance against those tables.
We can support this in CockroachDB by providing users the ability to mark a table as immutable, and using that information to provide fast, transactional reads to any replica.
Proposed solution for CockroachDB: Immutable tables
Immutable tables are tables that cannot be modified. Tables can be set as immutable, and reset back to mutable, by performing a schema change with an ordinary ALTER TABLE command.
When a table is marked as immutable, it's possible to drastically improve read performance from it by altering the read semantics. There are (at least) two ways of doing this:
- Using inconsistent reads against the table.
That's safe because we know that the table is immutable, implying that all of its replicas are up-to-date. - Using follower reads (read from a follower with timestamp bound #16593) against the table, with a read timestamp somewhere in the past, divorced from the active transaction timestamp. That's safe because follower reads guarantee that follower replicas are up-to-date as of some interval of time in the past.
The first solution is harder to implement, because of the semantics of leaving the immutable state - we'd have to invent a way to wait until data has fully replicated to all replicas. It also might be unsafe altogether - see @nvanbenschoten's comment on this issue.
The second solution is easier, and should be possible as soon as follower reads are available, using the schema change infrastructure and an intermediate state similar to the intermediate WRITE_ONLY state used for index backfills. The rest of this issue proposes to use the second solution.
Details on immutable/mutable states and state change process
Table descriptors will be extended with a new field, mutability, that has three possible values:
mutable- Reads from a
mutabletable behave as normal. - Writes to a
mutabletable are permitted.
- Reads from a
immutable-must-read-transactional- Reads from an
immutable-must-read-transactionaltable behave as normal. - Writes to an
immutable-must-read-transactionaltable are not permitted.
- Reads from an
immutable- Reads from an
immutabletable are performed at a historical timestamp,tseconds in the past (wheretis defined by the follower read safe past interval), and get served by a follower read - Writes to an
immutabletable are not permitted
- Reads from an
By default, mutability is set to mutable. To transition a table from mutable to immutable, a user issues an ALTER TABLE t SET IMMUTABLE command (or similar - syntax to be decided later), which performs the following actions:
- Set table descriptor to
immutable-must-read-transactionaland wait until the cluster converges on the new descriptor. - Wait
tseconds to ensure that, once transitioned toimmutable, all reads fromtseconds ago occur before the last possible write to the reference table. - Set table descriptor to
immutable.
To transition a table from immutable to mutable, a user issues an ALTER TABLE t SET MUTABLE command, which performs the following actions:
- Set table descriptor to
immutable-must-read-transactionaland wait until the cluster converges on the new descriptor. - Set table descriptor to
mutable. (note we don't have to waittseconds here, since we're moving in a "safe" direction, causing all future reads to be ordinary transactional reads.)