Is your feature request related to a problem? Please describe.
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long as kv.prober.planner.num_steps_to_plan_at_once is >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens every kv.prober.read.interval seconds), and in fact, if kv.prober.planner.num_steps_to_plan_at_once is set to a large number, doing planning on every prober tick could spend more resources they we want.
One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalling every kv.prober.read.interval seconds. Depending on how the relevant cluster settings are set, could this introduce new problems?
Possibly the answer to this is it's fine. Possibly so long as kv.prober.planner.num_steps_to_plan_at_once is not set very very high, the worst case scenario of scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 every kv.prober.read.interval seconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous.
@tbg and @knz, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy!
Describe the solution you'd like
We can consider a few things:
- Do an exponential backoff.
- Allow setting a max frequency for scanning meta2 & unmarshalling. If can't plan and have no next
Step, just increment a metric.
I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now.
Additional context
#61074
Is your feature request related to a problem? Please describe.
kvprober scans
kv.prober.planner.num_steps_to_plan_at_oncerows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long askv.prober.planner.num_steps_to_plan_at_onceis >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens everykv.prober.read.intervalseconds), and in fact, ifkv.prober.planner.num_steps_to_plan_at_onceis set to a large number, doing planning on every prober tick could spend more resources they we want.One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning
kv.prober.planner.num_steps_to_plan_at_oncerows worth of meta2 & unmarshalling everykv.prober.read.intervalseconds. Depending on how the relevant cluster settings are set, could this introduce new problems?Possibly the answer to this is it's fine. Possibly so long as
kv.prober.planner.num_steps_to_plan_at_onceis not set very very high, the worst case scenario of scanningkv.prober.planner.num_steps_to_plan_at_oncerows worth of meta2 everykv.prober.read.intervalseconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous.@tbg and @knz, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy!
Describe the solution you'd like
We can consider a few things:
Step, just increment a metric.I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now.
Additional context
#61074