Skip to content

perf: pipeline transactional writes at the KV layer #16026

@petermattis

Description

@petermattis

Consider an application which executes the following statements one-by-one (yes, this could be done more efficiently, but an ORM is likely to produce SQL like this):

BEGIN;
SELECT balance FROM accounts WHERE id = 10;
UPDATE accounts SET balance = 101 WHERE id = 10;
SELECT balance FROM accounts WHERE id = 11;
UPDATE accounts SET balance = 99 WHERE id = 11;
COMMIT;

This transaction is moving $1 from account 11 to account 10. Our current SQL implementation imposes unnecessary latency on the UPDATE operations. Internally, each UPDATE is a select for the matching rows followed by a consistent (replicated) write of the new balance. After the write completes we return the number of rows updated. We suffer the latency of the consistent write even though lower layers of the system will serialize access to a particular key. Translating into KV operations this looks like:

Begin
Scan /accounts/10-/accounts/11
Put  /accounts/10 101
Scan /accounts/11-/accounts/12
Put  /accounts/11 99
Commit

Notice that we know the number of rows that will be updated after the Scan operation completes. We can return to the client at that point and asynchronously send the Put. When the next statement arrives, we don't have to wait for any pending mutations as the KV layer will serialize the operations [*] for a particular key. We would have to wait for the outstanding mutations to complete before performing the commit. The above would become:

Begin
Scan /accounts/10-/accounts/11
  -> Go(Put /accounts/10 101)
Scan /accounts/11-/accounts/12
  -> Go(Put /accounts/11 99)
WaitForMutations
Commit

There is a similar win to be had for DELETE operations which involve a Scan followed by a DelRange. INSERT is somewhat more complicated as it is translated into a ConditionalPut operation which is internally a read (on the leader) followed by a write. We could potentially return from the ConditionalPut operation as soon as the write is proposed, but we'd have to leave the operation in the CommandQueue until the write is applied. There are likely lots of dragons here, though perhaps the approach of allowing a transactional write operation to return as soon as it is proposed could handle all of the cases here. The TxnCoordSender would then have to have a facility for waiting for the transactional write to be applied before allowing the transaction to be committed.

[*] While a replica will serialize operations for a particular key, multiple operations sent via separate DistSenders will not. We'd need to make sure that the operations sent for a transaction are somehow pipelined within the DistSender/TxnCoordSender. We wouldn't want a Scan operation to get reordered in front of a Put operation.

Cc @tschottdorf, @bdarnell

Metadata

Metadata

Assignees

Labels

A-kv-transactionsRelating to MVCC and the transactional model.C-performancePerf of queries or internals. Solution not expected to change functional behavior.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions