-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvprober: allow a manual probe of the entire keyspace #61695
Description
Is your feature request related to a problem? Please describe.
While the kvprober probes "random" kv ranges at intervals, when in an incident one also wants to conclusively establish health of the KV layer by proactively probing the entire keyspace as quickly as possible. This would either result in specific ranges that fail the probe (and thus an indication that there is an issue at the KV layer or below) or not (thus indicating a failure above the KV layer or a failure not caught by the probe).
Describe the solution you'd like
I think it would make sense to run something like
select crdb_internal.probe_range(start_key) from crdb_internal.ranges_no_leases;
range_id | pass | details
-----------+------+----------
1 | true | {"read_ns": 150031, "write_ns": ...}
[...]So the task at hand becomes implementing crdb_internal.probe.
Describe alternatives you've considered
One could likely come up with many alternative designs.
Additional context
Currently, in practice I believe we (or at least I) ascertain KV health by looking through the logs of messages of the form have been waiting X for Y which are emitted on slow replication, slow latching, slow DistSender RPCs, and a few others. With probes becoming suitably powerful, they should be able to catch most of these. These log messages are also hooked up to the slow.* family of gauges, which however don't let you figure out which ranges have the issue.
Jira issue: CRDB-2743