server: batch large span stats requests#109464
Conversation
maryliag
left a comment
There was a problem hiding this comment.
looks like you're missing a ./dev gen bazel
Also, you have tests failing, so not sure if there is something wrong on the code itself
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier)
e47fd1d to
1f0a990
Compare
1f0a990 to
a7b8941
Compare
j82w
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @zachlite)
pkg/server/span_stats_server.go line 387 at r2 (raw file):
// verifySpanStatsRequest returns an error if the request can not be serviced. // Requests can not be serviced if the active cluster version is less than 23.1, // or if the request is made using the pre 23.1 format.
Why can't the request serviced? Blocking mixed versions is painful for users and TSEs trying to troubleshoot.
pkg/server/span_stats_server.go line 433 at r2 (raw file):
// Allocate memory once for the required number of batches. totalSpans := len(req.Spans) batches := make([][]roachpb.Span, 0, (totalSpans+batchSize-1)/batchSize)
What is the benefit of batching? Why not just send everything at once?
zachlite
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)
pkg/server/span_stats_server.go line 387 at r2 (raw file):
Previously, j82w (Jake) wrote…
Why can't the request serviced? Blocking mixed versions is painful for users and TSEs trying to troubleshoot.
The semantics of the nodeID field changed between 22.2 and 23.1, and the startKey and endKey params were replaced with the spans field. Basically, the behavior of the endpoint and the wire formats changed enough between 22.2 and 23.1 to warrant breaking compatibility.
But there are no proto behavior changes in this PR - I was just moving some code around and figured I'd add a comment.
pkg/server/span_stats_server.go line 433 at r2 (raw file):
Previously, j82w (Jake) wrote…
What is the benefit of batching? Why not just send everything at once?
Resource consumption. If a SQL pod makes this request, we want to be able to attribute the cost of the request to the pod, while we keep the burden away from KV. KV servicing lots of small requests will consume less memory than one big request. At least in theory :D
j82w
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @zachlite)
pkg/server/span_stats_server.go line 446 at r2 (raw file):
res.SpanToStats = make(map[string]*roachpb.SpanStats, totalSpans) for _, batch := range batches {
Can this be optimized? The batches array allocation can be avoid by just merging the logic above with this loop. A slice can be used to limit the spans being sent.
|
Should the batching logic first split the request by node id then by batch size? It would be bad for perf if the first 800 are for node 1, then next 800 are for node 2, etc because it would end up doing multiple calls to node 2. |
|
What is the reason the batch logic was not done in the |
a7b8941 to
1aba190
Compare
zachlite
left a comment
There was a problem hiding this comment.
Dismissed @j82w from a discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @j82w)
pkg/server/span_stats_server.go line 454 at r2 (raw file):
Previously, j82w (Jake) wrote…
Should the batching logic first split the request by node id then by batch size? It would be bad for perf if the first 800 are for node 1, then next 800 are for node 2, etc because it would end up doing multiple calls to node 2.
Discussed offline. Batches need to happen before node ID lookups.
pkg/server/status.go line 3633 at r2 (raw file):
Previously, j82w (Jake) wrote…
What is the reason the batch logic was not done in the
s.sqlServer.tenantConnect.SpanStatswhich already has to split the request by node id?
Discussed offline. Requests directly to the system tenant should be batched too.
j82w
left a comment
There was a problem hiding this comment.
Reviewed 1 of 4 files at r1, 1 of 4 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @zachlite)
pkg/server/span_stats_server.go line 466 at r3 (raw file):
} for k, v := range batchRes.SpanToStats {
Is there any chance the same key can be returned multiple times? 1 from the lease holder and 1 from a non-lease holder? Just want to confirm there doesn't need to be some for of merge logic instead of just replacing the value.
pkg/server/span_stats_server_test.go line 88 at r3 (raw file):
// Assert that the responses from each batch are returned to the caller. require.Equal(t, tcase.nSpans, len(res.SpanToStats))
It would be good to validate the actual content is valid.
Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error. This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting `server.span_stats.span_batch_limit` is used to control the size of the batches. The default value has been raised to 1000. Resolves cockroachdb#105638 Epic: CRDB-30635 Release note (ops change): Requests for database details or table details from the UI, or usages of `SHOW RANGES WITH DETAILS` are no longer subject to errors if the number of requested spans is too large.
1aba190 to
4e14279
Compare
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-109464 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/109902/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, span stats requests could only request a maximum of 500 spans at a time, otherwise the request would return an error.
This commit adds transparent batching at both the system tenant's handler and the secondary tenant's handler. The cluster setting
server.span_stats.span_batch_limitis used to control the size of the batches. The default value has been raised to 1000.Resolves #105638
Epic: CRDB-30635
Release note (ops change): Requests for database details or table details from the UI, or usages of
SHOW RANGES WITH DETAILSare no longer subject to errors if the number of requested spans is too large.