kv: clarify the Sender contract and pre-allocate commit batches by andreimatei · Pull Request #30485 · cockroachdb/cockroach

andreimatei · 2018-09-20T23:41:15Z

See individual commits.

Release note: None

cockroach-teamcity · 2018-09-20T23:41:20Z

This change is

andreimatei · 2018-09-20T23:42:14Z

Tobi, putting this up on the flag pole to see if the cat licks it up.

cc @nvanbenschoten

tbg

I'm with you in spirit, and I'm glad the time has come where such reuse is actually important for performance, but I'm afraid that turning it off and going wild-west on the whole thing is going to cost us dearly in obscure bugs. Let's figure out what the contracts should be (to make the optimizations most powerful) and then let's think about the best way to enforce them (I hope that in the process we can get rid of this transport, which is quite expensive and slows down race a bunch).

The race transport doesn't only exist because we might be manipulating data on the server (in the local case). It also exists because grpc might be reading the data even after the call came back (or so we assume). In the spirit of there are no benign data races, it follows that unless you can prove that what you gave to gRPC is no longer going to be accessed by it, you don't want to mutate it. I haven't looked at what grpc does with the passed-in message exactly. Ideally it would encode it before it does anything async and then we're home free and can forget about the race transport for that purpose.

Then what's left is finding better ways to enforce the guarantees we want in our code.

For example,

Server side code is currently not allowed to mutate objects referenced by a Batch.

Can (mostly) be caught by hashing the batch on the way in and on the way out and comparing at the (Node).Batch endpoint (unless the requests gets passed elsewhere out-of-band, but that doesn't happen often and we can audit it). But I agree that we're probably fairly good here, though that too may change as we move objects between the heap and the stack (and so what was a shallow copy cedes to be one).

Ideally I think we get to a place where the incoming request (incl. the Txn proto) is treated immutable at the node (i.e. below DistSender) and where we feel comfortable that it's not passed around across goroutine boundaries anywhere in DistSender or above (and so can be pooled and reused aggressively).

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

I looked into grpc... At first it looked pretty convincing that there no funny business and once the thing returns from an Invoke() call, there's no referencing a request in the background or anything. But then I realized that I was looking at an older version. In the current version it's a bit harder to assert anything; it's not the most traceable code. But I didn't see any clues that there might be monkey business, and thinking about it I'm confident enough that it's all fine - like, it order for it to not be fine, the code would really have to try explicitly to screw you.

I've added better comments as per our convo to the Sender iface. PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained

bdarnell

I think it's reasonable to assume that grpc won't touch the input proto once the client method has returned. But I'm not sure about making that same assumption about DistSender. Couldn't tasks started by sendPartialBatchAsync still be running when DistSender.Send returns?

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei · 2018-09-24T18:25:48Z

Ben, I think your concern is valid. I hadn't seen this on prior inspection but, upon more looking, I think the following early return from the DistSender puts a wrench in the plans:

cockroach/pkg/kv/dist_sender.go

Line 811 in 8886bf1

return

Looks like we return without waiting for all the sub-batches when we error to "combine" two responses. How legitimate this errors are, I don't know. I'll see if I can get rid of them, or otherwise make the DistSender wait for all the async operations. Unless you think that's morally wrong?

So I guess in the meantime I'll take the relevant commit out of #30448.

bdarnell · 2018-09-24T18:47:21Z

Errors from combine are rare to non-existent in practice, so it ought to be possible to eliminate that early return. I think this is probably a good direction to pursue, but the guarantees involved are pretty subtle. I'm worried that we could miss something here or re-introduce problems later.

andreimatei · 2018-09-24T22:16:06Z

I got rid of those combine errors.
I'd like to move forward with this change and the general push for batch reuse, but I won't backport.

nvb

Reviewed 1 of 5 files at r1, 1 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/internal/client/txn.go, line 41 at r3 (raw file):

	typ TxnType

	// alloc holds pre-allocated fields that can be used to avoid heap allocations

Nice! The only thing I'll point out is that this kind of optimization is even more impactful when we start pooling the Txn object itself. Let's leave a TODO to do that, along with the TxnCoordSender. They're both huge objects that would benefit a lot from being pooled. Pooling them would also force us to clean up their lifetimes, which is a good idea on its own.

pkg/internal/client/txn.go, line 44 at r3 (raw file):

	// for batches with lifetimes tied to a Txn.
	alloc struct {
		requests [4]roachpb.RequestUnion

We're only ever using a single element out of this, so let's give this a length of 1.

pkg/roachpb/api.go, line 246 at r1 (raw file):

	}
	if rh.ResumeSpan != nil {
		log.Fatalf(ctx, "combining %+v with %+v", rh.ResumeSpan, otherRH.ResumeSpan)

I don't know about this. We shouldn't fatal if we get something we don't like back from a different node in the cluster. Why don't we just change the defer in divideAndSendBatchToRange to treat the error like any other error? It can return it later without immediately breaking from the loop.

bdarnell

Reviewed 4 of 5 files at r1, 3 of 3 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/api.go, line 246 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't know about this. We shouldn't fatal if we get something we don't like back from a different node in the cluster. Why don't we just change the defer in divideAndSendBatchToRange to treat the error like any other error? It can return it later without immediately breaking from the loop.

I agree. Changing a panic to a fatal here doesn't hurt anything, but it's moving in the wrong direction to force us to fatal instead of returning an error. It would be better to keep the error return from combine() and just delay the return until everything is done.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/api.go, line 246 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I agree. Changing a panic to a fatal here doesn't hurt anything, but it's moving in the wrong direction to force us to fatal instead of returning an error. It would be better to keep the error return from combine() and just delay the return until everything is done.

Before I change everything back, I'd like to litigate this more.
Ben, the position you've expressed when we were talking before seems inconsistent to me: you said you're fine with assertions about the behavior of a single process, but not so much about what is received from a different node. And you said that this policy preference is not necessarily related to different node version mismatches. But generally it seems to me like the only sane thing to do is have the same policy across nodes that we have within a single process. Like, we make a million assumptions about what another node will return; we can't accept any arbitrary data and trying to tolerate programming errors on another node that we don't tolerate locally... I don't see why we'd do that. If a remote node says this is the data at this key, we take it at face value. We don't assert that it returned data from the key we requested, and such.
What gives?

bdarnell · 2018-09-25T18:38:41Z

I'm just coming from the general principle that any crash that can be triggered by data received over the network is a bug (not necessarily a high-priority bug, but a bug nonetheless). If the existing code didn't have an error return I wouldn't be advocating for adding one at this point, but since it's there removing it seems like a step in the wrong direction.

I think it's worth a little extra effort to keep the error return from combine() and modify the loop to wait for all results regardless of error. But it's not worth a lot of extra effort. I was also mainly reacting to Nathan's comment without looking too closely at the particular panic here. If combining ResumeSpans and Errors are the only way combine can fail, then I'm not too worried about the panics.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/api.go, line 246 at r1 (raw file):
Pasting Ben's response here:

I'm just coming from the general principle that any crash that can be triggered by data received over the network is a bug (not necessarily a high-priority bug, but a bug nonetheless). If the existing code didn't have an error return I wouldn't be advocating for adding one at this point, but since it's there removing it seems like a step in the wrong direction.

I think it's worth a little extra effort to keep the error return from combine() and modify the loop to wait for all results regardless of error. But it's not worth a lot of extra effort. I was also mainly reacting to Nathan's comment without looking too closely at the particular panic here. If combining ResumeSpans and Errors are the only way combine can fail, then I'm not too worried about the panics.

Thanks. I'd still be inclined to debate what exactly you're thinking when you say "crash that can be triggered by data received over the network is a bug" - like, if we try to read the value of a sequence and we get back a string, would it be a "bug" if we crashed? Or an inconsistent lease? I don't think you want to argue that, so you probably want to refine your statement to only refer to some response "metadata". But even then, do you think that a more precise description of what you want to express can be phrased, and still be general enough that I can tattoo it on my chest?

We don't need to have a general discussion if you don't want to; we can discuss this specific.

combining something with a ResumeSpan was a panic; now it's a fatal. I think this indicates a situation where the gateway split a batch that it shouldn't have.
combining a response that had an error was an error. Now it's a Fatal. This indicates a bug in the DistSender's logic that's supposed to strip these errors before combining anything.
combining two "non-combinable" was an error. The patch makes it a fatal. "Non-combinable" means we were trying to combine a, say, ScanResponse with a PutResponse. Even before the patch, trying to combine a ScanResponse with a DelRangeResponse (where they're both "combinable") would be crash. This indicates a bug in how the gateway calculates the positions of the responses it combines or, I guess, a server-side bug where we rearrange the responses to individual constituents of a batch.

So, given these specific changes, do you object to the two new fatals introduced?

bdarnell · 2018-09-25T20:03:05Z

if we try to read the value of a sequence and we get back a string, would it be a "bug" if we crashed? Or an inconsistent lease? I don't think you want to argue that, so you probably want to refine your statement to only refer to some response "metadata".

I do want to argue that. We should strive to return an error instead of panicking for things like this. Data-dependent crashes are nearly always bugs (sometimes, like at startup, it really doesn't make sense to continue if we can't read what's on the disk, but ideally even that would be handled by returning an error up the stack).

So, given these specific changes, do you object to the two new fatals introduced?

Not enough to keep arguing about it.

nvb · 2018-09-25T20:18:16Z

+1 to @bdarnell's stance here. My general rule of thumb is that errors that can be scoped to a single request should be.

... before returning from DistSender.Send(). It had an obscure error case when it wasn't waiting for all. This waiting starts being mandated by the next commits. Release note: None

This patch clarifies the transport's contract wrt ownership and lifetime of the request. In particular, it makes it clear that the client is allowed to mutate the request after the the Send() code completed. The race transport is then relaxed a bit to allow some specific mutations. Release note: None

When commiting, the client.Txn used to allocate a slice of requests (inside a BatchRequest), a RequestUnion, and an EndTransactionRequest. This shows up on profiles of a single node read-heavy workload. This patch moves to pre-allocating these inside a Txn. This works since there's no concurrent commits on a single txn. Release note: None

andreimatei

I've reintroduced the combine errors (and converted the panic we had to an error too).
PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/internal/client/txn.go, line 41 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Nice! The only thing I'll point out is that this kind of optimization is even more impactful when we start pooling the Txn object itself. Let's leave a TODO to do that, along with the TxnCoordSender. They're both huge objects that would benefit a lot from being pooled. Pooling them would also force us to clean up their lifetimes, which is a good idea on its own.

Done.

pkg/internal/client/txn.go, line 44 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We're only ever using a single element out of this, so let's give this a length of 1.

Done.

pkg/roachpb/api.go, line 246 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Pasting Ben's response here:

I'm just coming from the general principle that any crash that can be triggered by data received over the network is a bug (not necessarily a high-priority bug, but a bug nonetheless). If the existing code didn't have an error return I wouldn't be advocating for adding one at this point, but since it's there removing it seems like a step in the wrong direction.

I think it's worth a little extra effort to keep the error return from combine() and modify the loop to wait for all results regardless of error. But it's not worth a lot of extra effort. I was also mainly reacting to Nathan's comment without looking too closely at the particular panic here. If combining ResumeSpans and Errors are the only way combine can fail, then I'm not too worried about the panics.

Thanks. I'd still be inclined to debate what exactly you're thinking when you say "crash that can be triggered by data received over the network is a bug" - like, if we try to read the value of a sequence and we get back a string, would it be a "bug" if we crashed? Or an inconsistent lease? I don't think you want to argue that, so you probably want to refine your statement to only refer to some response "metadata". But even then, do you think that a more precise description of what you want to express can be phrased, and still be general enough that I can tattoo it on my chest?

We don't need to have a general discussion if you don't want to; we can discuss this specific.

combining something with a ResumeSpan was a panic; now it's a fatal. I think this indicates a situation where the gateway split a batch that it shouldn't have.

combining a response that had an error was an error. Now it's a Fatal. This indicates a bug in the DistSender's logic that's supposed to strip these errors before combining anything.

combining two "non-combinable" was an error. The patch makes it a fatal. "Non-combinable" means we were trying to combine a, say, ScanResponse with a PutResponse. Even before the patch, trying to combine a ScanResponse with a DelRangeResponse (where they're both "combinable") would be crash. This indicates a bug in how the gateway calculates the positions of the responses it combines or, I guess, a server-side bug where we rearrange the responses to individual constituents of a batch.

So, given these specific changes, do you object to the two new fatals introduced?

This conversation had moved to the top level PR comments.

bdarnell

Reviewed 2 of 8 files at r4, 3 of 3 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained

nvb

Reviewed 8 of 8 files at r4, 2 of 3 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained

andreimatei

TFTRs

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained

craig · 2018-09-27T21:34:24Z

Timed out

andreimatei · 2018-09-27T22:31:08Z

bors r+

…

On Thu, Sep 27, 2018 at 5:34 PM craig[bot] ***@***.***> wrote: Timed out — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#30485 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXBcVVK_NrX7X1d66Aj7oCN9nH6N7Zsks5ufUR9gaJpZM4WzOEm> .

30485: kv: clarify the Sender contract and pre-allocate commit batches r=andreimatei a=andreimatei See individual commits. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

craig · 2018-09-27T23:42:23Z

Build succeeded

GitHub CI (Cockroach)

This reverts commit 06c1adf. Revert the last commit from cockroachdb#30485 - the one that pre-allocation and possible reuse of EndTransaction batches. It's causing races. I'll figure it out and re-send the original patch. Touches cockroachdb#30769 Release note: None

30773: Revert "client: don't allocate some commit batches" r=andreimatei a=andreimatei This reverts commit 06c1adf. Revert the last commit from #30485 - the one that pre-allocation and possible reuse of EndTransaction batches. It's causing races. I'll figure it out and re-send the original patch. Touches #30769 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

andreimatei assigned tbg Sep 20, 2018

andreimatei requested a review from a team September 20, 2018 23:41

tbg reviewed Sep 21, 2018

View reviewed changes

andreimatei force-pushed the batch-race branch from 1eb4e63 to c876f8e Compare September 21, 2018 20:27

andreimatei changed the title ~~kv: allow shallow mutations of requests in a batch~~ kv: clarify the Sender contract and relax the race transport Sep 21, 2018

andreimatei commented Sep 21, 2018

View reviewed changes

bdarnell reviewed Sep 24, 2018

View reviewed changes

andreimatei force-pushed the batch-race branch 2 times, most recently from 81e441b to e2c0dda Compare September 24, 2018 22:06

andreimatei assigned bdarnell and nvb Sep 24, 2018

andreimatei changed the title ~~kv: clarify the Sender contract and relax the race transport~~ kv: clarify the Sender contract and pre-allocate commit batches Sep 24, 2018

andreimatei force-pushed the batch-race branch from 6c6c382 to d5b3b2e Compare September 24, 2018 23:47

nvb suggested changes Sep 25, 2018

View reviewed changes

bdarnell reviewed Sep 25, 2018

View reviewed changes

andreimatei commented Sep 25, 2018

View reviewed changes

andreimatei added 2 commits September 26, 2018 12:01

kv, roachpb: make the DistSender wait for all sub-batches

dc496b4

... before returning from DistSender.Send(). It had an obscure error case when it wasn't waiting for all. This waiting starts being mandated by the next commits. Release note: None

andreimatei force-pushed the batch-race branch from d5b3b2e to 755d2be Compare September 26, 2018 16:01

andreimatei force-pushed the batch-race branch from 755d2be to 06c1adf Compare September 26, 2018 16:05

andreimatei commented Sep 26, 2018

View reviewed changes

bdarnell approved these changes Sep 26, 2018

View reviewed changes

nvb approved these changes Sep 27, 2018

View reviewed changes

andreimatei commented Sep 27, 2018

View reviewed changes

craig bot pushed a commit that referenced this pull request Sep 27, 2018

Merge #30485

6d8bd75

30485: kv: clarify the Sender contract and pre-allocate commit batches r=andreimatei a=andreimatei See individual commits. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

craig bot merged commit 06c1adf into cockroachdb:master Sep 27, 2018

andreimatei mentioned this pull request Sep 28, 2018

Revert "client: don't allocate some commit batches" #30773

Merged

andreimatei deleted the batch-race branch October 2, 2018 18:36

andreimatei mentioned this pull request Mar 12, 2019

Panic runtime error: index out of range when running 6 node tpc-c without range merging #34241

Closed

Conversation

andreimatei commented Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Sep 20, 2018

Uh oh!

andreimatei commented Sep 20, 2018

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei commented Sep 24, 2018

Uh oh!

bdarnell commented Sep 24, 2018

Uh oh!

andreimatei commented Sep 24, 2018

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell commented Sep 25, 2018

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell commented Sep 25, 2018

Uh oh!

nvb commented Sep 25, 2018

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Sep 27, 2018

Timed out

Uh oh!

andreimatei commented Sep 27, 2018 via email

Uh oh!

craig bot commented Sep 27, 2018

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andreimatei commented Sep 20, 2018 •

edited

Loading