kvprober: avoid repeated planning failures causing CRDB issues

**Is your feature request related to a problem? Please describe.**
kvprober scans `kv.prober.planner.num_steps_to_plan_at_once` rows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long as `kv.prober.planner.num_steps_to_plan_at_once` is >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens every `kv.prober.read.interval` seconds), and in fact, if `kv.prober.planner.num_steps_to_plan_at_once` is set to a large number, doing planning on every prober tick could spend more resources they we want.

One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning `kv.prober.planner.num_steps_to_plan_at_once` rows worth of meta2 & unmarshalling **every** `kv.prober.read.interval` seconds. Depending on how the relevant cluster settings are set, could this introduce new problems?

Possibly the answer to this is it's fine. Possibly so long as `kv.prober.planner.num_steps_to_plan_at_once` is not set very very high, the worst case scenario of scanning `kv.prober.planner.num_steps_to_plan_at_once` rows worth of meta2 every `kv.prober.read.interval` seconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous.

@tbg and @knz, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy!

**Describe the solution you'd like**
We can consider a few things:

1. Do an exponential backoff.
2. Allow setting a max frequency for scanning meta2 & unmarshalling. If can't plan and have no next `Step`, just increment a metric.

I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now.

**Additional context**
https://github.com/cockroachdb/cockroach/issues/61074

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvprober: avoid repeated planning failures causing CRDB issues #61255

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvprober: avoid repeated planning failures causing CRDB issues #61255

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions