kvprober: add probes for problem ranges#76081
kvprober: add probes for problem ranges#76081jmcarp wants to merge 1 commit intocockroachdb:masterfrom
Conversation
The kvprober provides good coverage of issues that affect many ranges, but has a lower probability of detecting individual bad ranges. To improve coverage of the latter case, remember failed ranges from the existing prober and probe them more frequently in a second pair of probe loops. Resolves cockroachdb#74407. Release note: None
Don't be silly; looks lovely! Get some tests in place & we can get it merged. |
joshimhoff
left a comment
There was a problem hiding this comment.
Not a full but a few comments here & there!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dasrirez and @jmcarp)
pkg/kv/kvprober/kvprober.go, line 74 at r1 (raw file):
func newProblemTracker() *problemTracker { return &problemTracker{size: 3}
I'd suggest introducing a cluster setting for the size of this thing!
pkg/kv/kvprober/kvprober.go, line 419 at r1 (raw file):
func (p *Prober) readProbeImpl(ctx context.Context, ops proberOps, txns proberTxn, pl planner, l ProbeLoop) { if !readEnabled.Get(&p.settings.SV) {
I suggest independent kill switches for the "problem range" probe loops, in the form of cluster settings. Just a bit more abstraction needed, I guess!
I also suggest independent controls on probes per minute for the "problem range" probe loops. I expect we can set those lower, since the probes are being made across less ranges.
pkg/kv/kvprober/kvprober.go, line 524 at r1 (raw file):
} d := timeutil.Since(start)
I wonder if we should do a little extra work to also put a range with slower latencies than unusual in a "problem queue". Would add a bit of extra complexity, but I think it might be worth it, in particular since IIUC (I might not) there is nothing in KV that is very strong at proactively signaling about an available but slow range today, and there are various forms of per-range rate limiting and back-pressure that can lead a range to be slow but available. For example, I don't think @tbg's lovely circuit breaker is going to yell if we have a slow but available range.
I think the easiest way to do this this would be yet another set of two probe loops. Probe a range from the high latency probe loops if the latency of some recent probe is slower than the latency of some probe currently in the "problem latency queue".
The circuit breaker currently trips if a command is replicating for more than |
Would it be too lazy to memorize the n slowest probes by range, then periodically probe one of them at random?
@joshimhoff, do you think this is enough for us, or would it still be helpful add probes for problematically slow ranges? |
|
I am not currently looped deeply into this topic and so am hesitant to give a strong opinion, but intuitively I do get the sense that we might be trying to do too much at once here. It seems that |
|
@tbg, is your response to the idea of trying to probe slow but available ranges as per @jmcarp's and I's discussion up above about latency? Or to the current state of the PR, which is focused on repeatedly probing "unavailable" ranges (more concretely, ranges that
The problem with this idea to me is then
If we measure errors over count of probes over an hour long window, if We don't get cluster wide coverage at every time moment, and that is certainly an issue! But still I think your retry idea tho simple would hurt the usability of the prober in multi-range outage scenarios too much, and such scenarios are fairly common (e.g. they happen when cloud infra breaks across a zone or similar).
Do you want to take a peak at the PR if you haven't already? Certainly this PR adds complexity to I think you describe the main benefit of this PR well down below:
I also think for 22.2 KV & SRE should spend more time working together on KV alerting! |
|
Let's make a decision here, @jmcarp & @tbg. I can see doing one of two things:
@tbg, see #76081 (comment) for a response to your suggestion to just retry failed probes. Though it is super simple at a code level, I don't particularly like that idea. I don't feel strongly between 1 & 2, in particular given that we can still alert on single range issues with existing |
|
My hesitancy mostly stems from not being sure what exactly we want kvprober to detect. I don't think it reliably gives us the "% of keyspace down" numbers (since it doesn't fire off probes open-loop and also doesn't randomize that much), so I'm just not sure if it is the source of truth I would use for such numbers and so I'm not sure if the non-sticky prober is as useful as you think. But I'm not an ops person and maybe it is! We could talk about that more. Since I like the sticky probing, and that's what's being added here, assuming the code is ok I think we should add it. Then we can have a separate conversation about "breadth-first" probing (i.e. original kvprober) and see if we want to make changes to it. |
Sounds good! In general, I look forward to the 22.2 cycle as I am hoping KV & SRE can work more closely on alerting & observability.
Ack! I agree. It's nice to have this PR even with circuit breaker, since
Sounds good! @jmcarp, want to get this one ready for a real review, if you agree with the decision here? Let's def not do anything about latency. That would add too much complexity, and we can talk more about it as part of 22.2 planning, at a higher level, and with KV team. |
|
With a feasible to implement version of #71169 coming into view, I think we should close this for now! |
|
I've come back around to some version of this idea being a good one. I'd really like it if had high confidence in our ability to detect single range outages in CC, even if with a time delay given that all ranges are not probed at every second. And tho I'm excited about #71169, (i) we don't have it yet and (ii) it's sort of orthogonal to the idea of detecting single range outages at high confidence. @tbg suggested sticky probing:
I think I vote we do this but slightly tweaked. If a probe to some range fails, we probe it on every other probe (or similar). The other half the time, we probe via the planner. Then we get data on size of outage (not perfect but some data) and we also have confidence we will eventually catch single range outages (assuming not many CRDB restarts). I expect this won't add much complexity to the code? Maybe we should just code it up and compare it to this implementation. One Q about this for Tobias: This means all nodes will eventually send a probe repeatedly to an unavailable range (assuming there is just one unavailable range). As the number of nodes grows, so will the probe rate on the unavailable range. Is this a problem? I guess one advantage of the implementation here is that the "problem range" probe rate is separate from the always-on planner-style probe rate, and so it can be set lower. Also, CC @Santamaura for KV observability perspective. |
|
Apologies for the delay in getting back. I have difficulty providing a snappy answer because I'm uncertain if we're taking the right approach (the prober was simple - now it gets more complicated; and are we just backtracking on its fundamental design?) and because other commitments that I consider more urgent require a bit of time management on my end. I'll discuss next steps with @mwang1026 and @shralex tomorrow. |
|
No rush, and tell me if there are things I can do to reduce your workload when helping me out! I'm also uncertain if we are taking the right approach. I'm not uncertain that we should be able to detect single range hard down outages at high confidence. I'm also not really uncertain that probing at a high level is a good technique for this. The problem isn't that a probe won't catch single range outages like the 6sense outage [1]; it is that we aren't sending probes in such a way that lead to an alert given the outage type (single range outage). [1] We did caught this outage. But we caught it with sqlprober, not kvprober. If wasn't a system range, we wouldn't have caught it. |
Maybe we want a "quarantine" system. If you fail a probe, you get quarantined (if there's room, otherwise ignore quarantine). Ranges that are in quarantine get probed separately (to avoid slowing down the main prober) at an interval. We track the time of entrance into the quarantine pool and report the duration the oldest member has been there as a metric. Then you can alert on that (instead of on any metric about write failures), without a temporary blip messing up the alerting. With that setup, we don't have to rely on failing probes to determine whether there is an outage that's long enough to page someone. Instead, we get to look directly at the quantity we care about, namely how long it's been since we know of the outage. It would work like an exact mirror image of the "uptime" metric. |
|
I like that idea, @tbg! I'll try to find some time to give it a shot soon. |
Note: this is a quick sketch to check whether I'm thinking about the problem correctly. The code is probably not great! I'll clean this up and add tests after getting some feedback on the general approach.
The kvprober provides good coverage of issues that affect many ranges,
but has a lower probability of detecting individual bad ranges. To
improve coverage of the latter case, remember failed ranges from the
existing prober and probe them more frequently in a second pair of probe
loops.
Resolves #74407.
Release note (<category, see below>):