-
Notifications
You must be signed in to change notification settings - Fork 4.1k
perf: pipeline transactional writes at the KV layer #16026
Description
Consider an application which executes the following statements one-by-one (yes, this could be done more efficiently, but an ORM is likely to produce SQL like this):
BEGIN;
SELECT balance FROM accounts WHERE id = 10;
UPDATE accounts SET balance = 101 WHERE id = 10;
SELECT balance FROM accounts WHERE id = 11;
UPDATE accounts SET balance = 99 WHERE id = 11;
COMMIT;
This transaction is moving $1 from account 11 to account 10. Our current SQL implementation imposes unnecessary latency on the UPDATE operations. Internally, each UPDATE is a select for the matching rows followed by a consistent (replicated) write of the new balance. After the write completes we return the number of rows updated. We suffer the latency of the consistent write even though lower layers of the system will serialize access to a particular key. Translating into KV operations this looks like:
Begin
Scan /accounts/10-/accounts/11
Put /accounts/10 101
Scan /accounts/11-/accounts/12
Put /accounts/11 99
Commit
Notice that we know the number of rows that will be updated after the Scan operation completes. We can return to the client at that point and asynchronously send the Put. When the next statement arrives, we don't have to wait for any pending mutations as the KV layer will serialize the operations [*] for a particular key. We would have to wait for the outstanding mutations to complete before performing the commit. The above would become:
Begin
Scan /accounts/10-/accounts/11
-> Go(Put /accounts/10 101)
Scan /accounts/11-/accounts/12
-> Go(Put /accounts/11 99)
WaitForMutations
Commit
There is a similar win to be had for DELETE operations which involve a Scan followed by a DelRange. INSERT is somewhat more complicated as it is translated into a ConditionalPut operation which is internally a read (on the leader) followed by a write. We could potentially return from the ConditionalPut operation as soon as the write is proposed, but we'd have to leave the operation in the CommandQueue until the write is applied. There are likely lots of dragons here, though perhaps the approach of allowing a transactional write operation to return as soon as it is proposed could handle all of the cases here. The TxnCoordSender would then have to have a facility for waiting for the transactional write to be applied before allowing the transaction to be committed.
[*] While a replica will serialize operations for a particular key, multiple operations sent via separate DistSenders will not. We'd need to make sure that the operations sent for a transaction are somehow pipelined within the DistSender/TxnCoordSender. We wouldn't want a Scan operation to get reordered in front of a Put operation.