Skip to content

kvprober: allow a manual probe of the entire keyspace #61695

@tbg

Description

@tbg

Is your feature request related to a problem? Please describe.
While the kvprober probes "random" kv ranges at intervals, when in an incident one also wants to conclusively establish health of the KV layer by proactively probing the entire keyspace as quickly as possible. This would either result in specific ranges that fail the probe (and thus an indication that there is an issue at the KV layer or below) or not (thus indicating a failure above the KV layer or a failure not caught by the probe).

Describe the solution you'd like

I think it would make sense to run something like

select crdb_internal.probe_range(start_key) from crdb_internal.ranges_no_leases;
  range_id | pass | details
-----------+------+----------
        1  | true | {"read_ns": 150031, "write_ns": ...}
[...]

So the task at hand becomes implementing crdb_internal.probe.

Describe alternatives you've considered

One could likely come up with many alternative designs.

Additional context

Currently, in practice I believe we (or at least I) ascertain KV health by looking through the logs of messages of the form have been waiting X for Y which are emitted on slow replication, slow latching, slow DistSender RPCs, and a few others. With probes becoming suitably powerful, they should be able to catch most of these. These log messages are also hooked up to the slow.* family of gauges, which however don't let you figure out which ranges have the issue.

Jira issue: CRDB-2743

Metadata

Metadata

Assignees

Labels

A-kvAnything in KV that doesn't belong in a more specific category.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions