client,kv: rationalize the TxnCoordSender lifetime#35224
client,kv: rationalize the TxnCoordSender lifetime#35224andreimatei wants to merge 3 commits intocockroachdb:masterfrom
Conversation
wip Release note: None
Release note: None
|
Nathan, this patch isn't ready, but I'd like your feedback, particularly on the interface changes in |
nvb
left a comment
There was a problem hiding this comment.
There are a lot of comments in the PR so it's hard to tell what the proposal is. My TL;DR is that I'm a big 👍 on moving the rollbacks to the TxnCoordSender and a 👍 on simplifying the interface to send rollbacks from client.Txn.
In general, we need to make TxnCoordSender easier to use without an in-depth understanding of its internals. Making rollbacks opaque moves in that direction. I'm not sure that all of the other changes in here do.
Reviewed 5 of 13 files at r1, 4 of 22 files at r2, 11 of 14 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/internal/client/batch.go, line 79 at r3 (raw file):
} // MustErr returns the error resulting from a failed execution of the batch,
Do we actually need this anymore?
pkg/internal/client/sender.go, line 101 at r3 (raw file):
} type ErrWithIndex struct {
To be honest, I don't really think this is a net win. We're trading one obscure error wrapper for another (more obscure) one and now there are just more things to keep straight. There are also now two almost identical Sender interfaces and their difference doesn't seem particularly well justified.
What do we get out of this past what we would get from zeroing out pErr.Txn before returning it from the txn coord sender?
I think the most compelling reason to make a change like this is the ErrorDetail point you made. Creating new ErrorDetail variants specific to the client.Txn-TxnCoordSender boundary is horrible. But is it unavoidable if we return a roachpb.Error from TxnCoordSender? I'm not convinced. We've been pretty undisciplined about creating and maintinaing a clean abstraction boundary here in the past. This PR is a big step in the right direction! I wouldn't be suprised if continuing to pull on this thread allows other simplification like removing TxnAlreadyEncounteredErrorError to fall out.
What do you think?
pkg/internal/client/sender.go, line 161 at r3 (raw file):
// RollbackAsync initiates a rollback. RollbackAsync(context.Context)
Nice. This doesn't necessarily need to be a new method though. We could just as easily catch Send(EndTransaction{Commit: false}) and short-circuit the execution.
pkg/internal/client/sender.go, line 537 at r3 (raw file):
} // TxnRestartError indicates that a retriable error happened and so the
On a similar note, I'm not sure about this. Keeping this as an ErrorDetail doesn't bother me at all because it's not important to the client.Txn where it even comes from. It very well could be coming from a kv server.
Either way, let's ensure that the name "TransactionRetryWithProtoRefreshError" doesn't survive this change.
pkg/internal/client/txn.go, line 914 at r3 (raw file):
// origTxnID is the id of the txn that generated retryErr. Note that this can be // different from retryErr.Transaction - the latter might be a new transaction. func (txn *Txn) replaceSenderIfTxnAbortedLocked(ctx context.Context, retryErr TxnRestartError) {
While we're rethinking some of this, let's keep #22615 in mind. I don't think we need to move all the way to a "txn handle" model to make a big improvement there.
pkg/kv/txn_coord_sender.go, line 64 at r3 (raw file):
// // txnErrorClientNotified // // txnErrorClientNotNotified
Won't these kinds of states between components lead to extra coupling, which we're trying hard to avoid? We're adding a lot of logic and a lot of moving parts into the TxnCoordSender so that... we can have good assertions when the client.Txn uses the TxnCoordSender unexpectedly?
We should be pushing to make it harder to misuse TxnCoordSender instead.
pkg/kv/txn_interceptor_heartbeater.go, line 449 at r3 (raw file):
} // abortTxnAsyncLocked send an EndTransaction(commmit=false) asynchronously.
Yeah, moving this into the TxnCoordSender is fantastic.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/txn_coord_sender.go, line 1018 at r3 (raw file):
ba.Header = roachpb.Header{Txn: &txn} ba.Add(&roachpb.EndTransactionRequest{ Commit: false,
Raphael insists on sometimes keeping the poisoning here
andreimatei
left a comment
There was a problem hiding this comment.
Note to self: one thing to consider is that, if we keep the rollbacks "async" in the TxnCoordSender, then we either have to keep poisoning the abort spans when intents are cleared, or have another mechanism to (in)validate results coming from remote DistSQL flow.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
This patch changes how rolling back of transactions on errors works. It used to be that the TxnCoordSender would sometimes rollback a transaction, sometimes it wouldn't: it would do it on
TransactionAbortedErrors, but not on any other error that a batch encounters. It would also do it when the heartbeat loop detects the transaction as aborted asynchronously. When it did roll back, the rollback code lived in the heartbeat loop interceptor, which made it very awkward for the TxnCoordSender to perform it in case ofTransactionAbortedErrors- theTxnCoordSenderwas calling directly into the interceptor which then uses a callback to call back into the TCS... It was all a mess.This patch moves all the rollback concerns to the TCS, and also makes the behavior more clear: the TCS now rolls back (async) on every error, except non-
TransactionAbortedErrorretriable errors (so, it doesn't rollback on epoch bumps, naturally). This all seems like it's the way it should have been for a while now: since the only thing a transaction can do after encountering an error is rollback, why (sometimes) wait for the client to send that rollback?So, now, the situation is simple: as soon as the
client.Txnreceives an error, theTxnCoordSendershould not be used any more (or, rather, for the future we should zero it out and put it back in some pool). A call toclient.Txn.Rollback()after an error becomes a no-op.This patch also does something else major: it purges
pErrsfrom the client interface.TxnCoordSenderterminatespErrsand returns regularerrorsupwards. Thus, theTCSno longer implements theSenderinterface.This has been a dream of mine since years ago. The fact that
pErrswere returned by TCS meant that all sorts of errors had to be created as protos and put in theErrorDetailunion, when in fact all they wanted to be is plain-old go errors. Also,pErrscome with baggage like aTxnbeing attached to them, which makes no sense to clients, and are generally awkward. TheTCScan't quite return simple errors, unfortunately, since the index of the request within the batch that generated an error is still needed by SQL - mainly because someCPuterrors need to be transformed into "duplicate key" errors and such. So, TCS.Send() and other batch-oriented interfaces in theTxnand theDBnow return*ErrWithIndex.