server: allow an arbitrary number of spans in a span stats request

**Is your feature request related to a problem? Please describe.**

As of #98490, A single `SpanStats` request can request stats for multiple spans via its `spans []roachpb.Span` field. However, the maximum number of spans allowed per request is controlled by the cluster setting `server.span_stats.span_batch_limit`, which has a default value of 500. The original purpose of `server.span_stats.span_batch_limit` was to ensure that the caller received a speedy response. 


#103128 made `SpanStats` a dependency of `SHOW RANGES WITH DETAILS`, which immediately caused [roachtest failures](https://github.com/cockroachdb/cockroach/issues/105274) where the cluster under test had more ranges than the default 500.

The cluster setting was bumped as a workaround in https://github.com/cockroachdb/cockroach/pull/105500, but the following realities still exist:
 1. We can expect customers to be faced with this same error, and deal with the friction of needing to bump the cluster setting themselves. Friction is not good.
 2. The roachtest failures are preventing #103128 from being backported in its current form.
 3. Even if a customer bumps the setting, there's no batching. This could lead to increased rates of failed requests due to timeouts, failed transactions, etc.

----

**Describe the solution you'd like**

A `SpanStats` request should be able to service an arbitrary number of spans in a reasonable amount of time.
In the wild, the largest clusters can have 1e6 ranges or more, and we must be able to service these cases.

A possible implementation is proposed [here:](https://github.com/cockroachdb/cockroach/pull/105317#issuecomment-1602201138)
 > ...The solution will likely involve using bounded batches to fetch the span stats and streaming the results to the client.

Challenges:
- Determining the appropriate batch size and concurrency levels.
- Span stats relies on RPC node fan-outs, which are slow.
- Span stats should be considered authoritative, and to achieve this, it relies on Meta2 scans to look up range descriptors. These transactions add overhead. Perhaps there's profiling that would reveal low-hanging fruit to optimize. 
- Furthermore, if SpanStats is authoritative, does that rule out the option to pre-compute and cache?
- Callers of SpanStats like `SHOW RANGES WITH DETAILS` do not stream results back to the caller, so a streaming solution for `SpanStats` that lowered the "time to first result" would not get results back to the _user_ any faster, because `SHOW RANGES` would still block.


Jira issue: CRDB-29134
Epic: CRDB-30635

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: allow an arbitrary number of spans in a span stats request #105638

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server: allow an arbitrary number of spans in a span stats request #105638

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions