-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver,server,sql: provide efficient mechanism to retrieve data size information for span #84105
Description
Jira: CRDB-17463
Epic: CRDB-22711
Is your feature request related to a problem? Please describe.
SQL observability desires access to fine-grained size information for tables, and, probably before too long even indexes and index partitions. This shows up in a few places. One is via crdb_internal.ranges where we request some data on a per range basis. Another is in the admin server in TableStats and DatabaseStatsThere are two primary mechanisms by which this information has been retrieved in the past, both of which have problems:
serverpb.SpanStatsRequest-- used to fetch table stats- This request is a per-node request which relatively efficiently aggregates the MVCC stats and approximate disk bytes for a given key span, if the number of ranges is large relative to the number of nodes. Conversely, the fact that the request gets sent to every node means that it might be more expensive if the number of nodes is large relative to the number of ranges.
- The approximate disk bytes out of pebble are particularly valuable because they are the only way to get a sense for physical byte usage, and they can be safely aggregated to produce a meaningful value. That being said, while I might think of the disk usage as a good thing to know, it's definitely more complicated to explain and reason about than logical bytes.
- One can sort of reason about aggregating the MVCC stats too, but would need to divide properly by the replication factor.
roachpb.RangeStatsRequest-- used inside some of the virtual tables- This request is a per-range request, that means it is likely to be more expensive if the number of nodes is small relative to the number of ranges.
- This request is generally sent sequentially, which means it can be very slow.
- This request is non-transactional and only needs to go to a local replica, so if there is a replica for every relevant range in the local region, it can be fast.
An issue with both of these requests is that neither provides a mechanism to compute MVCC statistics for a span which is smaller than a range; both provide only MVCC stats for all of the ranges which intersect the span in question. In multi-tenant cockroach, this poses a serious problem today because we pack tables into a range in that setting. We would like to pack tables together also in "regular" cockroach (#81008), which would break existing mechanisms. This will also get in the way of collecting information for indexes which are already collocated inside a span.
Another relevant piece of context is that in many cases which care about data size, we also care about placement in terms of regions of the replicas and of the leaseholder. The RangeStatsResponse carries RangeInfo which has the lease information and the descriptor.
Up to this point, the above mentioned requests only look at in-memory data structures; there is no scanning of data required to produce responses. This is nice from an efficiency perspective, but seems untenable. We'll need to provide tools to compute sub-range MVCC statistics.
Describe the solution you'd like
As a first pass, keeping the client-API range-oriented does not seem a like a huge problem -- we do not have evidence that the number of RPCs causes big problems. It's also the case that if a span intersects a great many ranges, most of the ranges will be fully contained by the span -- so today's in-memory behavior continues to apply. A proposal, which may be a weak one, would be to introduce a new request which sits in-between roachpb.RangeStatsRequest and serverpb.SpanStatsRequest: roachpb.SpanStatsRequest. This request will leverage DistSender parallelism to perform its work. When the relevant portion of the span does not fully cover the range, the actual MVCC stats for that span will be computed by scanning the data. This should only end up happening for up to two ranges. The response can contain placement and leaseholder information for each range which intersects. This information could be aggregated when merging the responses or presented on a per-range basis.
Describe alternatives you've considered
One alternative would be to stick to a store-oriented API.
Additional context
Reducing RPCs
Ideally, we'd need to only send a number of RPCs proportional to the number of KV nodes containing data intersecting the span[s] in question. An open question is whether the O(ranges) RPCs is actively a problem given the ability of the distsender to parallelize and the fact that there likely needs to be independent locking (#34999). It is possible that we may want to allow for multiple spans in the same request, but that decision is out of scope for this initial issue statement. Likely we should push the problem of fairness and cost of parallelism into the distsender to coordinate with admission control.
Aggregating internally
One observation is that in many of the higher level use cases for these requests, we don't really care about individual ranges, we just care about spans and their aggregate statistics. This is particularly true in #84090. Indeed there may be value in obfuscating the individual ranges and nodes from the client.