client,kv: rationalize the TxnCoordSender lifetime by andreimatei · Pull Request #35224 · cockroachdb/cockroach

andreimatei · 2019-02-27T02:37:49Z

This patch changes how rolling back of transactions on errors works. It used to be that the TxnCoordSender would sometimes rollback a transaction, sometimes it wouldn't: it would do it on TransactionAbortedErrors, but not on any other error that a batch encounters. It would also do it when the heartbeat loop detects the transaction as aborted asynchronously. When it did roll back, the rollback code lived in the heartbeat loop interceptor, which made it very awkward for the TxnCoordSender to perform it in case of TransactionAbortedErrors - the TxnCoordSender was calling directly into the interceptor which then uses a callback to call back into the TCS... It was all a mess.

This patch moves all the rollback concerns to the TCS, and also makes the behavior more clear: the TCS now rolls back (async) on every error, except non-TransactionAbortedError retriable errors (so, it doesn't rollback on epoch bumps, naturally). This all seems like it's the way it should have been for a while now: since the only thing a transaction can do after encountering an error is rollback, why (sometimes) wait for the client to send that rollback?
So, now, the situation is simple: as soon as the client.Txn receives an error, the TxnCoordSender should not be used any more (or, rather, for the future we should zero it out and put it back in some pool). A call to client.Txn.Rollback() after an error becomes a no-op.

This patch also does something else major: it purges pErrs from the client interface. TxnCoordSender terminates pErrs and returns regular errors upwards. Thus, the TCS no longer implements the Sender interface.
This has been a dream of mine since years ago. The fact that pErrs were returned by TCS meant that all sorts of errors had to be created as protos and put in the ErrorDetail union, when in fact all they wanted to be is plain-old go errors. Also, pErrs come with baggage like a Txn being attached to them, which makes no sense to clients, and are generally awkward. The TCS can't quite return simple errors, unfortunately, since the index of the request within the batch that generated an error is still needed by SQL - mainly because some CPut errors need to be transformed into "duplicate key" errors and such. So, TCS.Send() and other batch-oriented interfaces in the Txn and the DB now return *ErrWithIndex.

wip Release note: None

Release note: None

cockroach-teamcity · 2019-02-27T02:37:57Z

This change is

andreimatei · 2019-02-27T02:38:43Z

Nathan, this patch isn't ready, but I'd like your feedback, particularly on the interface changes in Txn and DB. You should look at the commits as squashed.

nvb

There are a lot of comments in the PR so it's hard to tell what the proposal is. My TL;DR is that I'm a big 👍 on moving the rollbacks to the TxnCoordSender and a 👍 on simplifying the interface to send rollbacks from client.Txn.

In general, we need to make TxnCoordSender easier to use without an in-depth understanding of its internals. Making rollbacks opaque moves in that direction. I'm not sure that all of the other changes in here do.

Reviewed 5 of 13 files at r1, 4 of 22 files at r2, 11 of 14 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/internal/client/batch.go, line 79 at r3 (raw file):

}

// MustErr returns the error resulting from a failed execution of the batch,

Do we actually need this anymore?

pkg/internal/client/sender.go, line 101 at r3 (raw file):

}

type ErrWithIndex struct {

To be honest, I don't really think this is a net win. We're trading one obscure error wrapper for another (more obscure) one and now there are just more things to keep straight. There are also now two almost identical Sender interfaces and their difference doesn't seem particularly well justified.

What do we get out of this past what we would get from zeroing out pErr.Txn before returning it from the txn coord sender?

I think the most compelling reason to make a change like this is the ErrorDetail point you made. Creating new ErrorDetail variants specific to the client.Txn-TxnCoordSender boundary is horrible. But is it unavoidable if we return a roachpb.Error from TxnCoordSender? I'm not convinced. We've been pretty undisciplined about creating and maintinaing a clean abstraction boundary here in the past. This PR is a big step in the right direction! I wouldn't be suprised if continuing to pull on this thread allows other simplification like removing TxnAlreadyEncounteredErrorError to fall out.

What do you think?

pkg/internal/client/sender.go, line 161 at r3 (raw file):

	// RollbackAsync initiates a rollback.
	RollbackAsync(context.Context)

Nice. This doesn't necessarily need to be a new method though. We could just as easily catch Send(EndTransaction{Commit: false}) and short-circuit the execution.

pkg/internal/client/sender.go, line 537 at r3 (raw file):

}

// TxnRestartError indicates that a retriable error happened and so the

On a similar note, I'm not sure about this. Keeping this as an ErrorDetail doesn't bother me at all because it's not important to the client.Txn where it even comes from. It very well could be coming from a kv server.

Either way, let's ensure that the name "TransactionRetryWithProtoRefreshError" doesn't survive this change.

pkg/internal/client/txn.go, line 914 at r3 (raw file):

// origTxnID is the id of the txn that generated retryErr. Note that this can be
// different from retryErr.Transaction - the latter might be a new transaction.
func (txn *Txn) replaceSenderIfTxnAbortedLocked(ctx context.Context, retryErr TxnRestartError) {

While we're rethinking some of this, let's keep #22615 in mind. I don't think we need to move all the way to a "txn handle" model to make a big improvement there.

pkg/kv/txn_coord_sender.go, line 64 at r3 (raw file):

//   // txnErrorClientNotified
//
//   txnErrorClientNotNotified

Won't these kinds of states between components lead to extra coupling, which we're trying hard to avoid? We're adding a lot of logic and a lot of moving parts into the TxnCoordSender so that... we can have good assertions when the client.Txn uses the TxnCoordSender unexpectedly?

We should be pushing to make it harder to misuse TxnCoordSender instead.

pkg/kv/txn_interceptor_heartbeater.go, line 449 at r3 (raw file):

}

// abortTxnAsyncLocked send an EndTransaction(commmit=false) asynchronously.

Yeah, moving this into the TxnCoordSender is fantastic.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/kv/txn_coord_sender.go, line 1018 at r3 (raw file):

				ba.Header = roachpb.Header{Txn: &txn}
				ba.Add(&roachpb.EndTransactionRequest{
					Commit: false,

Raphael insists on sometimes keeping the poisoning here

andreimatei

Note to self: one thing to consider is that, if we keep the rollbacks "async" in the TxnCoordSender, then we either have to keep poisoning the abort spans when intents are cleared, or have another mechanism to (in)validate results coming from remote DistSQL flow.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

bobvawter · 2020-03-10T16:31:50Z

All committers have signed the CLA.

andreimatei added 3 commits February 26, 2019 21:12

kv, client: rationalize error handling and rollback

9338b6a

wip Release note: None

pErr->err binge

84a7268

ErrWithIndex

edbe26b

Release note: None

andreimatei requested review from a team and nvb February 27, 2019 02:37

nvb reviewed Feb 28, 2019

View reviewed changes

andreimatei mentioned this pull request Mar 20, 2019

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Closed

nvb mentioned this pull request Apr 10, 2019

[WIP] kv: improve transaction heartbeats' interaction with concurrent commits #36729

Open

andreimatei commented May 9, 2019

View reviewed changes

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

andreimatei commented Sep 24, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client,kv: rationalize the TxnCoordSender lifetime#35224

client,kv: rationalize the TxnCoordSender lifetime#35224
andreimatei wants to merge 3 commits intocockroachdb:masterfrom
andreimatei:txn.cleanup

andreimatei commented Feb 27, 2019

Uh oh!

cockroach-teamcity commented Feb 27, 2019

Uh oh!

andreimatei commented Feb 27, 2019

Uh oh!

nvb left a comment

Uh oh!

andreimatei left a comment

Uh oh!

andreimatei left a comment

Uh oh!

bobvawter commented Mar 10, 2020 •

edited by cockroach-teamcity

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

andreimatei commented Feb 27, 2019

Uh oh!

cockroach-teamcity commented Feb 27, 2019

Uh oh!

andreimatei commented Feb 27, 2019

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bobvawter commented Mar 10, 2020 • edited by cockroach-teamcity Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bobvawter commented Mar 10, 2020 •

edited by cockroach-teamcity

Loading