Skip to content

kvcoord: fix extremely rare OOM hazard in the DistSender#88506

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
arulajmani:dist-sender-87167
Sep 27, 2022
Merged

kvcoord: fix extremely rare OOM hazard in the DistSender#88506
craig[bot] merged 1 commit intocockroachdb:masterfrom
arulajmani:dist-sender-87167

Conversation

@arulajmani
Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani commented Sep 22, 2022

The outer loop in the DistSender subdivides a BathRequest into partial batches corresponding to a single range and sends them out. Previously, the resolved span passed into the function responsible for sending out this partial batch (sendPartialBatch) corresponded to the span of the entire batch request (as opposed to the partial batch). This resolved span is used to check if the request needs to be subdivided further between retries in the outer loop of the DistSender. Given the wrong parameter value, we'd always end up determining that the batch needed to be subdivided.

This wasn't really an issue in practice as we don't expect too many retries here. However, if we did (eg. timeouts causing the transport to be exhausted on every try), the infinite recursion here could lead to an OOM.

References #87167

Release note: None

@arulajmani arulajmani requested a review from nvb September 22, 2022 20:30
@arulajmani arulajmani requested a review from a team as a code owner September 22, 2022 20:30
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! It looks like this is fallout from 7fb06a2#diff-271a3179bf7f12a2ed29b993611bf24f4c3c4a582a785d5d29a61d2d02c5f21fL1553, where we removed the needsTruncate argument and logic from sendPartialBatch but continued passing the untruncated rs span. So we won't need to backport this to v22.1 or earlier, but we'll want to get this back to v22.2 as soon as possible.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


-- commits line 17 at r1:
It's probably worth pointing at 7fb06a2 in the commit message, in case someone wants to understand this in the future.


pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4604 at r1 (raw file):

	splitRDB := mockRangeDescriptorDBForDescs(splitDescs...)

	var numAttempts atomic.Int32

atomic.Int32 is new as of Go 1.19. I thought we had kept the go directive in the go.mod file at 1.17 so that new features (i.e. generics) could not be used yet. And yet, CI is green. @rickystewart do you know what's going on here?

In the meantime, let's switch this back to a var numAttempts int32 and use the old atomic.AddInt32 and atomic.LoadInt32 so we can proceed with this PR.


pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4637 at r1 (raw file):

	require.True(t, testutils.IsError(pErr.GoError(), "boom"))
	// 6 attempts each for the two partial batches.
	require.Equal(t, int32(12), numAttempts.Load())

How exactly does this fail without your fix?

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/kvclient/kvcoord/dist_sender.go line 1535 at r1 (raw file):

}

// sendPartialBatch sends the supplied batch to the range specified by desc.

s/by desc/by the routing token/


pkg/kv/kvclient/kvcoord/dist_sender.go line 1539 at r1 (raw file):

// The batch request is supposed to be truncated already so that it contains
// only requests which intersect the range descriptor and keys for each request
// are limited to the range's key span. positions argument describes how the

Consider updating this comment to mention the relationship between rs and the current range descriptor (in routingTok).

@tbg
Copy link
Copy Markdown
Member

tbg commented Sep 23, 2022

Great find! Just for my understanding, in #87167 did we actually hit the "extremely rare OOM hazard"? We do seem to get a very large stack which points to yes, but then also there wasn't an extremely deep DistSender stack there.

@erikgrinaker
Copy link
Copy Markdown
Contributor

Yeah, great find @arulajmani, thanks for tracking this down!

in #87167 did we actually hit the "extremely rare OOM hazard"? We do seem to get a very large stack which points to yes, but then also there wasn't an extremely deep DistSender stack there.

I'm pretty sure that DistSender goroutine was to blame for the stack overflow, but I agree that the stack was suspiciously shallow for a 1 GB stack. I can only assume that the stack frames were very large or some such. Maybe this is better in Go 1.19 given the new stack handling? Might be worth having a look at some typical stack sizes in steady-state clusters and see what's up, the runtime has some metrics for it.

The outer loop in the DistSender subdivides a BatchRequest into partial
batches corresponding to a single range and sends them out. Previously,
the resolved span passed into the function responsible for sending out
this partial batch (`sendPartialBatch`) corresponded to the span of the
entire batch request (as opposed to the partial batch). This
resolved span is used to check if the request needs to be subdivided
further between retries in the outer loop of the DistSender. Given the
wrong parameter value, we'd always end up determining that the batch
needed to be subdivided.

This wasn't really an issue in practice as we don't expect too many
retries here. However, if we did (eg. timeouts causing the transport
to be exhausted on every try), the infinite recursion here could lead
to an OOM.

The issue here was introduced in #7fb06a22d6b5ac19f764306bdb43133946da9664
when we stopped truncating the supplied `rs` to the supplied range
descriptor.

References cockroachdb#87167

Release note: None
Copy link
Copy Markdown
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Might be worth having a look at some typical stack sizes in steady-state clusters

I looked at the DistSender stack, and doing some stack pointer/frame pointer math, these stack frames seem to be around ~2KB. The stack doesn't seem deep enough to hit that 1GB limit, so maybe there's more to look at here. I'll reword the commit message so that merging this PR doesn't close the issue.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @rickystewart)


-- commits line 17 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's probably worth pointing at 7fb06a2 in the commit message, in case someone wants to understand this in the future.

Done.


pkg/kv/kvclient/kvcoord/dist_sender.go line 1539 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider updating this comment to mention the relationship between rs and the current range descriptor (in routingTok).

Added more words here.


pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4604 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

atomic.Int32 is new as of Go 1.19. I thought we had kept the go directive in the go.mod file at 1.17 so that new features (i.e. generics) could not be used yet. And yet, CI is green. @rickystewart do you know what's going on here?

In the meantime, let's switch this back to a var numAttempts int32 and use the old atomic.AddInt32 and atomic.LoadInt32 so we can proceed with this PR.

I didn't realize this was a new feature. Great advert for GoLand autocomplete though haha


pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4637 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How exactly does this fail without your fix?

It would just timeout because the maxRetries we set above won't be respected. It's not the greatest test, given we don't configure this in production; maybe we should?

@arulajmani
Copy link
Copy Markdown
Collaborator Author

bors r=nvanbenschoten

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 26, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 27, 2022

Build succeeded:

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Sep 27, 2022

I'm pretty sure that DistSender goroutine was to blame for the stack overflow, but I agree that the stack was suspiciously shallow for a 1 GB stack.

I also don't see how this adds up to 1GB. Subtracting the first stack pointer from the last in this goroutine does not give 1GB. And yet, we see a matching stack pointer in the panic text:

runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0xc1fb20c480 stack=[0xc1fb20c000, 0xc21b20c000]
fatal error: stack overflow

Notice that 0xc1fb20c480 is just beyond the address of the last stack frame that we see in the goroutine dump:

goroutine 5053528 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).reset(0xc066f4e700?, 0x1a3a00d3dd593d05?, 0x593a67a0ff4f5455?, {0x513f1f6?, 0x13?}, 0x4d1c58?, {0xc0bbb0299f62a59e?, 0xe2360df4425?, 0x9789580?}, 0xc01b2d06e0, ...)
        github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:577 +0xcd7 fp=0xc1fb20c490 sp=0xc1fb20c488 pc=0xc24937
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Tracer).newSpan(0x0?, 0x0?, 0x0?, {0x513f1f6, 0x13}, 0x0?, {0x0?, 0x0?, 0x9789580?}, 0xc01b2d06e0, ...)
        github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:971 +0x16f fp=0xc1fb20c570 sp=0xc1fb20c490 pc=0xc2b5af
...

So this goroutine does seem to be at fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants