-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: concurrent use of client.Txn can cause unintended data to be committed in the face of retryable errors #17197
Description
I believe the following scenario is possible:
- two
RETURNING NOTHINGstatements, A and B use the sameclient.Txn, and continuously send write batches on it. - one of the batches sent by A encounters
TransactionRestartError. The proto attached to that error has an updated epoch. client.Txn.send()updates the local proto to that incremented epoch.- A gets the error and stops sending requests.
- however, nobody tells B anything - it continues executing and sending batches. Except now the batches it sends are using the incremented epoch.
- after B eventually finishes, the sql
Executorwill drain the parallel stmt execution queue and find out about A's retryable error. At that point, it'll tell the client to restart the sql txn. - the client restarts, expecting that everything that has been previously written to be discarded.
- but surprise, the last batches sent by B before the restart will be committed.
This looks to me to be a consequence of adding support for concurrency to the client.Txn internals without adapting its API to multiple threads. The assumption of client.Txn.send() is that clients respond immediately to errors and stop sending requests - except that's not what happens with our concurrent use of the txn by multiple sql statements.
I think I'll fix it by adding a multithreaded flavor to the client.Txn API whereby clients get "locked" to an epoch. Once locked, requests are not sent if the txn has been advanced to a different epoch. When a response is received (success or error), it should also not update the txn proto if the epoch has been incremented since the time it was sent (I'm not sure if this is strictly required given how Transaction.Update() only ratchets up epochs and timestamps, but it seems like a sane thing to do to me). The Executor would do this "locking of the epoch" of the KV txn every time it restarts a SQL transaction.
I'm not sure who should implement this API, though. Should it be implemented by a wrapper (or "handle") on top of a client.Txn - say, client.TxnHandle - and then should clients be able to use either a client.Txnor a client.TxnHandle (by extracting a common interface)?
cc @cockroachdb/core @nvanbenschoten @tschottdorf @bdarnell