Skip to content

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

@joshimhoff

Description

@joshimhoff

Is your feature request related to a problem? Please describe.
kvprober probes all ranges. Single range issues happen. kvprober will detect such issues but the resulting error rate will be extremely low (1 / number of ranges in the cluster). This makes alerting on such an issue hard.

Describe the solution you'd like
kvprober could "remember" when it probes a range and doesn't get back a successful (or fast) response. kvprober could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges. kvprober could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, when kvprober discovers a candidate problem range, it focuses on producing data about that range.

Describe alternatives you've considered
An alternative is to not do this, but write a long time-window log-based alert on multiple errors in a row for RPCs to a specific range. This might be workable, tho the time to page would be much lower than with above. Also would need #74405.

Additional context
N/A

@tbg & @andreimatei: Wdyt about this idea?

Jira issue: CRDB-12069

Metadata

Metadata

Assignees

Labels

A-kv-observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions