-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407
Description
Is your feature request related to a problem? Please describe.
kvprober probes all ranges. Single range issues happen. kvprober will detect such issues but the resulting error rate will be extremely low (1 / number of ranges in the cluster). This makes alerting on such an issue hard.
Describe the solution you'd like
kvprober could "remember" when it probes a range and doesn't get back a successful (or fast) response. kvprober could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges. kvprober could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, when kvprober discovers a candidate problem range, it focuses on producing data about that range.
Describe alternatives you've considered
An alternative is to not do this, but write a long time-window log-based alert on multiple errors in a row for RPCs to a specific range. This might be workable, tho the time to page would be much lower than with above. Also would need #74405.
Additional context
N/A
@tbg & @andreimatei: Wdyt about this idea?
Jira issue: CRDB-12069