kv: buffered writes

This issue explores the potential benefits (and costs) of introducing a new mode of operations for transactions, whereby writes would be buffered on the (KV) client until commit time. Currently, of course, writes are eagerly sent to their respective leaseholders and begin (async) replication. 

#### Motivation

1) Avoiding pipeline stalls on future reads of the written key within the same txn. As things stand, if a txn writes a key and then reads it, the reads will block on latches until the prior writes finishes its async replication. This can be seen as a stall in the pipeline of transaction operations. But, given that the transaction knows what it's writing, and it knows that the successful replication of the write is checked at commit time, blocking on the latches seems unnecessary. We can imagine the value of the respective key coming from the transaction's write buffer, rather than from the leaseholder's storage.
    - Common reads could avoid going to the leaseholder, being served instead exclusively on the client. For example, FK checks satisfied by a write in the same txn - e.g. insert in parent table and then into the child table in the same txn, as seen in TPCC (citation needed).
    - This idea of avoiding blocking on latches seems to only be applicable for writes from the same txn. If there's an in-flight write from another txn, even if the reader somehow had access to the proposed value, it couldn't simply read it; the reader would have to block (on a lock) until the writer commits or aborts.
2) Avoiding writers blocking readers during the writer txn's evaluation. As things stand, writers eagerly take locks, which then block readers for the writer's lifetime. If instead writers were buffered until commit time, and no locks were taken eagerly, then readers would only conflict with committing writers, not with evaluating writers. This would mean that locks would be held for unbounded amounts of time only in cases of failures, not in cases of long-running writing transaction.
      - There are also downsides to not taking locks eagerly: the writer does not get the protection of the respective locks, and is more susceptible to being forced to change its timestamp at commit time (and thus perform a refresh). We can imagine that the policy of acquiring locks eagerly on writes would be beneficial sometimes (perhaps on transaction retries - epochs >= 1). Even when locks are acquired eagerly, the buffering of writes still gives most other advantages listed here.
3) Amortizing write latency. At the moment, write pipelining amortizes the replication latency for writes. Still, each write performs a synchronous round-trip to the leaseholder. Buffering would avoid that.
    - This only works for blind writes, which are not common in SQL. Read-write operations (e.g. CPut) would need to be split into a read phase (which could also optionally take locks; see above) and a (buffered) write.
4) More 1PC. At the moment, many stars need to align to get a coveted 1PC execution - the SQL statement needs to be run as an implicit txn, the statement needs to be simple enough, the SQL execution node needs to support committing the txn in the same batch as its mutations. It's impossible to get 1PC in an explicit txn (e.g; BEGIN; INSERT; COMMIT never gets it), and it's impossible to get 1PC when inserting into two tables (e.g. even when we had interleaved tables, we still couldn't get it). By buffering writes, we'd no longer need SQL's cooperation for getting 1PC execution; it can all be under the KV client's control.
    - 1PC is good for throughput (less work to do per txn) and for tail latency (fewer round-trips between the client and the leaseholder).
    - Interleaved tables are gone. But, maybe collocated tables will come back through https://github.com/cockroachdb/cockroach/issues/65726 allowing us to get 1PC across tables.

I think 1) and 4) are big.

#### Drawbacks

- There'd be complexity involved with supporting read-your-writes within a transaction if writes were buffered. We'd have to decide who's in a good position to interleave the results coming from the buffer with results coming from storage. It could be the client or the leaseholder. For eliding reads going to the leaseholder when they're satisfied by the buffer, the client needs to be involved.
    -  One implementation idea would be to ship the buffer to leaseholders serving reads and have the MVCC layer use iterators to interleave the buffer contents and scan results.
    - DistSQL is an extra complication because it would seem we need SQL to collaborate in shipping the buffer around with the flows it schedules remotely.
- Buffering writes until commit time on the client will have a memory footprint. Presumably we'd bound it, like we do with other txn memory footprints.
- Splitting non-blind writes into a read and a buffered write may add up to an extra request to the leaseholder.
- The start of the replication of write intents would be deferred to commit time. But it would still be executed in parallel with the replication of the STAGING txn record.
- Writers are more susceptible to needing refreshes (see above).
- Write-write contention likes the existence of write locks. But maybe this point is moot now with SFU locking, which does not necessarily need imply blocking non-locking (snapshot in Spanner terms) reads (think [Upgrade locks](https://github.com/cockroachdb/cockroach/issues/49684))

Jira issue: CRDB-11233

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: buffered writes #72614

Motivation

Drawbacks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kv: buffered writes #72614

Description

Motivation

Drawbacks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions