Skip to content

client,kv: rationalize the TxnCoordSender lifetime#35224

Open
andreimatei wants to merge 3 commits intocockroachdb:masterfrom
andreimatei:txn.cleanup
Open

client,kv: rationalize the TxnCoordSender lifetime#35224
andreimatei wants to merge 3 commits intocockroachdb:masterfrom
andreimatei:txn.cleanup

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

This patch changes how rolling back of transactions on errors works. It used to be that the TxnCoordSender would sometimes rollback a transaction, sometimes it wouldn't: it would do it on TransactionAbortedErrors, but not on any other error that a batch encounters. It would also do it when the heartbeat loop detects the transaction as aborted asynchronously. When it did roll back, the rollback code lived in the heartbeat loop interceptor, which made it very awkward for the TxnCoordSender to perform it in case of TransactionAbortedErrors - the TxnCoordSender was calling directly into the interceptor which then uses a callback to call back into the TCS... It was all a mess.

This patch moves all the rollback concerns to the TCS, and also makes the behavior more clear: the TCS now rolls back (async) on every error, except non-TransactionAbortedError retriable errors (so, it doesn't rollback on epoch bumps, naturally). This all seems like it's the way it should have been for a while now: since the only thing a transaction can do after encountering an error is rollback, why (sometimes) wait for the client to send that rollback?
So, now, the situation is simple: as soon as the client.Txn receives an error, the TxnCoordSender should not be used any more (or, rather, for the future we should zero it out and put it back in some pool). A call to client.Txn.Rollback() after an error becomes a no-op.

This patch also does something else major: it purges pErrs from the client interface. TxnCoordSender terminates pErrs and returns regular errors upwards. Thus, the TCS no longer implements the Sender interface.
This has been a dream of mine since years ago. The fact that pErrs were returned by TCS meant that all sorts of errors had to be created as protos and put in the ErrorDetail union, when in fact all they wanted to be is plain-old go errors. Also, pErrs come with baggage like a Txn being attached to them, which makes no sense to clients, and are generally awkward. The TCS can't quite return simple errors, unfortunately, since the index of the request within the batch that generated an error is still needed by SQL - mainly because some CPut errors need to be transformed into "duplicate key" errors and such. So, TCS.Send() and other batch-oriented interfaces in the Txn and the DB now return *ErrWithIndex.

@andreimatei andreimatei requested review from a team and nvb February 27, 2019 02:37
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

Nathan, this patch isn't ready, but I'd like your feedback, particularly on the interface changes in Txn and DB. You should look at the commits as squashed.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of comments in the PR so it's hard to tell what the proposal is. My TL;DR is that I'm a big 👍 on moving the rollbacks to the TxnCoordSender and a 👍 on simplifying the interface to send rollbacks from client.Txn.

In general, we need to make TxnCoordSender easier to use without an in-depth understanding of its internals. Making rollbacks opaque moves in that direction. I'm not sure that all of the other changes in here do.

Reviewed 5 of 13 files at r1, 4 of 22 files at r2, 11 of 14 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/internal/client/batch.go, line 79 at r3 (raw file):

}

// MustErr returns the error resulting from a failed execution of the batch,

Do we actually need this anymore?


pkg/internal/client/sender.go, line 101 at r3 (raw file):

}

type ErrWithIndex struct {

To be honest, I don't really think this is a net win. We're trading one obscure error wrapper for another (more obscure) one and now there are just more things to keep straight. There are also now two almost identical Sender interfaces and their difference doesn't seem particularly well justified.

What do we get out of this past what we would get from zeroing out pErr.Txn before returning it from the txn coord sender?

I think the most compelling reason to make a change like this is the ErrorDetail point you made. Creating new ErrorDetail variants specific to the client.Txn-TxnCoordSender boundary is horrible. But is it unavoidable if we return a roachpb.Error from TxnCoordSender? I'm not convinced. We've been pretty undisciplined about creating and maintinaing a clean abstraction boundary here in the past. This PR is a big step in the right direction! I wouldn't be suprised if continuing to pull on this thread allows other simplification like removing TxnAlreadyEncounteredErrorError to fall out.

What do you think?


pkg/internal/client/sender.go, line 161 at r3 (raw file):

	// RollbackAsync initiates a rollback.
	RollbackAsync(context.Context)

Nice. This doesn't necessarily need to be a new method though. We could just as easily catch Send(EndTransaction{Commit: false}) and short-circuit the execution.


pkg/internal/client/sender.go, line 537 at r3 (raw file):

}

// TxnRestartError indicates that a retriable error happened and so the

On a similar note, I'm not sure about this. Keeping this as an ErrorDetail doesn't bother me at all because it's not important to the client.Txn where it even comes from. It very well could be coming from a kv server.

Either way, let's ensure that the name "TransactionRetryWithProtoRefreshError" doesn't survive this change.


pkg/internal/client/txn.go, line 914 at r3 (raw file):

// origTxnID is the id of the txn that generated retryErr. Note that this can be
// different from retryErr.Transaction - the latter might be a new transaction.
func (txn *Txn) replaceSenderIfTxnAbortedLocked(ctx context.Context, retryErr TxnRestartError) {

While we're rethinking some of this, let's keep #22615 in mind. I don't think we need to move all the way to a "txn handle" model to make a big improvement there.


pkg/kv/txn_coord_sender.go, line 64 at r3 (raw file):

//   // txnErrorClientNotified
//
//   txnErrorClientNotNotified

Won't these kinds of states between components lead to extra coupling, which we're trying hard to avoid? We're adding a lot of logic and a lot of moving parts into the TxnCoordSender so that... we can have good assertions when the client.Txn uses the TxnCoordSender unexpectedly?

We should be pushing to make it harder to misuse TxnCoordSender instead.


pkg/kv/txn_interceptor_heartbeater.go, line 449 at r3 (raw file):

}

// abortTxnAsyncLocked send an EndTransaction(commmit=false) asynchronously.

Yeah, moving this into the TxnCoordSender is fantastic.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/txn_coord_sender.go, line 1018 at r3 (raw file):

				ba.Header = roachpb.Header{Txn: &txn}
				ba.Add(&roachpb.EndTransactionRequest{
					Commit: false,

Raphael insists on sometimes keeping the poisoning here

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: one thing to consider is that, if we keep the rollbacks "async" in the TxnCoordSender, then we either have to keep poisoning the abort spans when intents are cleared, or have another mechanism to (in)validate results coming from remote DistSQL flow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

@bobvawter
Copy link
Copy Markdown
Contributor

bobvawter commented Mar 10, 2020

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-noremind Bots won't notify about PRs with X-noremind

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants