Skip to content

server: allow an arbitrary number of spans in a span stats request #105638

@zachlite

Description

@zachlite

Is your feature request related to a problem? Please describe.

As of #98490, A single SpanStats request can request stats for multiple spans via its spans []roachpb.Span field. However, the maximum number of spans allowed per request is controlled by the cluster setting server.span_stats.span_batch_limit, which has a default value of 500. The original purpose of server.span_stats.span_batch_limit was to ensure that the caller received a speedy response.

#103128 made SpanStats a dependency of SHOW RANGES WITH DETAILS, which immediately caused roachtest failures where the cluster under test had more ranges than the default 500.

The cluster setting was bumped as a workaround in #105500, but the following realities still exist:

  1. We can expect customers to be faced with this same error, and deal with the friction of needing to bump the cluster setting themselves. Friction is not good.
  2. The roachtest failures are preventing sql: extend SHOW RANGES to include object size estimates #103128 from being backported in its current form.
  3. Even if a customer bumps the setting, there's no batching. This could lead to increased rates of failed requests due to timeouts, failed transactions, etc.

Describe the solution you'd like

A SpanStats request should be able to service an arbitrary number of spans in a reasonable amount of time.
In the wild, the largest clusters can have 1e6 ranges or more, and we must be able to service these cases.

A possible implementation is proposed here:

...The solution will likely involve using bounded batches to fetch the span stats and streaming the results to the client.

Challenges:

  • Determining the appropriate batch size and concurrency levels.
  • Span stats relies on RPC node fan-outs, which are slow.
  • Span stats should be considered authoritative, and to achieve this, it relies on Meta2 scans to look up range descriptors. These transactions add overhead. Perhaps there's profiling that would reveal low-hanging fruit to optimize.
  • Furthermore, if SpanStats is authoritative, does that rule out the option to pre-compute and cache?
  • Callers of SpanStats like SHOW RANGES WITH DETAILS do not stream results back to the caller, so a streaming solution for SpanStats that lowered the "time to first result" would not get results back to the user any faster, because SHOW RANGES would still block.

Jira issue: CRDB-29134
Epic: CRDB-30635

Metadata

Metadata

Assignees

Labels

C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)branch-masterFailures and bugs on the master branch.release-blockerIndicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.v23.1.12

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions