Skip to content

server: batch large span stats requests#109464

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
zachlite:230821.span_stats_batches
Sep 1, 2023
Merged

server: batch large span stats requests#109464
craig[bot] merged 1 commit intocockroachdb:masterfrom
zachlite:230821.span_stats_batches

Conversation

@zachlite
Copy link
Copy Markdown
Contributor

Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting server.span_stats.span_batch_limit is used to control the size of the batches. The default value has been raised to 1000.

Resolves #105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details from the UI, or usages of SHOW RANGES WITH DETAILS are no longer subject to errors if the number of requested spans is too large.

@zachlite zachlite requested a review from a team as a code owner August 24, 2023 21:13
@zachlite zachlite requested a review from a team August 24, 2023 21:13
@zachlite zachlite requested a review from a team as a code owner August 24, 2023 21:13
@zachlite zachlite requested review from a team and abarganier and removed request for a team August 24, 2023 21:13
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@zachlite zachlite added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Aug 24, 2023
Copy link
Copy Markdown
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you're missing a ./dev gen bazel
Also, you have tests failing, so not sure if there is something wrong on the code itself

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier)

@zachlite zachlite force-pushed the 230821.span_stats_batches branch from e47fd1d to 1f0a990 Compare August 25, 2023 14:38
@zachlite zachlite requested a review from a team as a code owner August 25, 2023 14:38
@zachlite zachlite force-pushed the 230821.span_stats_batches branch from 1f0a990 to a7b8941 Compare August 25, 2023 15:08
Copy link
Copy Markdown
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @zachlite)


pkg/server/span_stats_server.go line 387 at r2 (raw file):

// verifySpanStatsRequest returns an error if the request can not be serviced.
// Requests can not be serviced if the active cluster version is less than 23.1,
// or if the request is made using the pre 23.1 format.

Why can't the request serviced? Blocking mixed versions is painful for users and TSEs trying to troubleshoot.


pkg/server/span_stats_server.go line 433 at r2 (raw file):

	// Allocate memory once for the required number of batches.
	totalSpans := len(req.Spans)
	batches := make([][]roachpb.Span, 0, (totalSpans+batchSize-1)/batchSize)

What is the benefit of batching? Why not just send everything at once?

Copy link
Copy Markdown
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)


pkg/server/span_stats_server.go line 387 at r2 (raw file):

Previously, j82w (Jake) wrote…

Why can't the request serviced? Blocking mixed versions is painful for users and TSEs trying to troubleshoot.

The semantics of the nodeID field changed between 22.2 and 23.1, and the startKey and endKey params were replaced with the spans field. Basically, the behavior of the endpoint and the wire formats changed enough between 22.2 and 23.1 to warrant breaking compatibility.

But there are no proto behavior changes in this PR - I was just moving some code around and figured I'd add a comment.


pkg/server/span_stats_server.go line 433 at r2 (raw file):

Previously, j82w (Jake) wrote…

What is the benefit of batching? Why not just send everything at once?

Resource consumption. If a SQL pod makes this request, we want to be able to attribute the cost of the request to the pod, while we keep the burden away from KV. KV servicing lots of small requests will consume less memory than one big request. At least in theory :D

@abarganier abarganier removed their request for review August 28, 2023 15:00
Copy link
Copy Markdown
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @zachlite)


pkg/server/span_stats_server.go line 446 at r2 (raw file):

	res.SpanToStats = make(map[string]*roachpb.SpanStats, totalSpans)

	for _, batch := range batches {

Can this be optimized? The batches array allocation can be avoid by just merging the logic above with this loop. A slice can be used to limit the spans being sent.

@j82w
Copy link
Copy Markdown
Contributor

j82w commented Aug 31, 2023

pkg/server/span_stats_server.go line 454 at r2 (raw file):

		for k, v := range batchRes.SpanToStats {
			res.SpanToStats[k] = v

Should the batching logic first split the request by node id then by batch size? It would be bad for perf if the first 800 are for node 1, then next 800 are for node 2, etc because it would end up doing multiple calls to node 2.

@j82w
Copy link
Copy Markdown
Contributor

j82w commented Aug 31, 2023

pkg/server/status.go line 3633 at r2 (raw file):

	batchSize := int(roachpb.SpanStatsBatchLimit.Get(&s.st.SV))
	return batchedSpanStats(ctx, req, s.sqlServer.tenantConnect.SpanStats, batchSize)

What is the reason the batch logic was not done in the s.sqlServer.tenantConnect.SpanStats which already has to split the request by node id?

@zachlite zachlite force-pushed the 230821.span_stats_batches branch from a7b8941 to 1aba190 Compare August 31, 2023 16:39
Copy link
Copy Markdown
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @j82w from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w)


pkg/server/span_stats_server.go line 454 at r2 (raw file):

Previously, j82w (Jake) wrote…

Should the batching logic first split the request by node id then by batch size? It would be bad for perf if the first 800 are for node 1, then next 800 are for node 2, etc because it would end up doing multiple calls to node 2.

Discussed offline. Batches need to happen before node ID lookups.


pkg/server/status.go line 3633 at r2 (raw file):

Previously, j82w (Jake) wrote…

What is the reason the batch logic was not done in the s.sqlServer.tenantConnect.SpanStats which already has to split the request by node id?

Discussed offline. Requests directly to the system tenant should be batched too.

Copy link
Copy Markdown
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 4 files at r1, 1 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @zachlite)


pkg/server/span_stats_server.go line 466 at r3 (raw file):

		}

		for k, v := range batchRes.SpanToStats {

Is there any chance the same key can be returned multiple times? 1 from the lease holder and 1 from a non-lease holder? Just want to confirm there doesn't need to be some for of merge logic instead of just replacing the value.


pkg/server/span_stats_server_test.go line 88 at r3 (raw file):

		// Assert that the responses from each batch are returned to the caller.
		require.Equal(t, tcase.nSpans, len(res.SpanToStats))

It would be good to validate the actual content is valid.

Previously, span stats requests could only request a maximum
of 500 spans at a time, otherwise the request would return an error.

This commit adds transparent batching at both the system tenant's handler
and the secondary tenant's handler. The cluster setting
`server.span_stats.span_batch_limit` is used to control the
size of the batches. The default value has been raised to 1000.

Resolves cockroachdb#105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details
from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject
to errors if the number of requested spans is too large.
@zachlite zachlite force-pushed the 230821.span_stats_batches branch from 1aba190 to 4e14279 Compare September 1, 2023 14:39
@zachlite
Copy link
Copy Markdown
Contributor Author

zachlite commented Sep 1, 2023

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 1, 2023

Build succeeded:

@craig craig bot merged commit d9c137d into cockroachdb:master Sep 1, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Sep 1, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-109464 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/109902/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: allow an arbitrary number of spans in a span stats request

4 participants