kvprober: add probes for problem ranges by jmcarp · Pull Request #76081 · cockroachdb/cockroach

jmcarp · 2022-02-04T20:17:56Z

Note: this is a quick sketch to check whether I'm thinking about the problem correctly. The code is probably not great! I'll clean this up and add tests after getting some feedback on the general approach.

The kvprober provides good coverage of issues that affect many ranges,
but has a lower probability of detecting individual bad ranges. To
improve coverage of the latter case, remember failed ranges from the
existing prober and probe them more frequently in a second pair of probe
loops.

Resolves #74407.

Release note (<category, see below>):

The kvprober provides good coverage of issues that affect many ranges, but has a lower probability of detecting individual bad ranges. To improve coverage of the latter case, remember failed ranges from the existing prober and probe them more frequently in a second pair of probe loops. Resolves cockroachdb#74407. Release note: None

cockroach-teamcity · 2022-02-04T20:18:04Z

This change is

joshimhoff · 2022-02-04T21:02:16Z

The code is probably not great!

Don't be silly; looks lovely! Get some tests in place & we can get it merged.

joshimhoff

Not a full but a few comments here & there!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dasrirez and @jmcarp)

pkg/kv/kvprober/kvprober.go, line 74 at r1 (raw file):

func newProblemTracker() *problemTracker {
	return &problemTracker{size: 3}

I'd suggest introducing a cluster setting for the size of this thing!

pkg/kv/kvprober/kvprober.go, line 419 at r1 (raw file):

func (p *Prober) readProbeImpl(ctx context.Context, ops proberOps, txns proberTxn, pl planner, l ProbeLoop) {
	if !readEnabled.Get(&p.settings.SV) {

I suggest independent kill switches for the "problem range" probe loops, in the form of cluster settings. Just a bit more abstraction needed, I guess!

I also suggest independent controls on probes per minute for the "problem range" probe loops. I expect we can set those lower, since the probes are being made across less ranges.

pkg/kv/kvprober/kvprober.go, line 524 at r1 (raw file):

	}

	d := timeutil.Since(start)

I wonder if we should do a little extra work to also put a range with slower latencies than unusual in a "problem queue". Would add a bit of extra complexity, but I think it might be worth it, in particular since IIUC (I might not) there is nothing in KV that is very strong at proactively signaling about an available but slow range today, and there are various forms of per-range rate limiting and back-pressure that can lead a range to be slow but available. For example, I don't think @tbg's lovely circuit breaker is going to yell if we have a slow but available range.

I think the easiest way to do this this would be yet another set of two probe loops. Probe a range from the high latency probe loops if the latency of some recent probe is slower than the latency of some probe currently in the "problem latency queue".

tbg · 2022-02-05T15:44:06Z

For example, I don't think @tbg's lovely circuit breaker is going to yell if we have a slow but available range.

The circuit breaker currently trips if a command is replicating for more than kv.replica_circuit_breaker.slow_replication_threshold, which I plan to set to 15s for 22.1. In CC, this could be set to a lower value.

jmcarp · 2022-02-07T19:11:12Z

I think the easiest way to do this this would be yet another set of two probe loops. Probe a range from the high latency probe loops if the latency of some recent probe is slower than the latency of some probe currently in the "problem latency queue".

Would it be too lazy to memorize the n slowest probes by range, then periodically probe one of them at random?

The circuit breaker currently trips if a command is replicating for more than kv.replica_circuit_breaker.slow_replication_threshold, which I plan to set to 15s for 22.1. In CC, this could be set to a lower value.

@joshimhoff, do you think this is enough for us, or would it still be helpful add probes for problematically slow ranges?

tbg · 2022-02-07T20:12:37Z

I am not currently looped deeply into this topic and so am hesitant to give a strong opinion, but intuitively I do get the sense that we might be trying to do too much at once here. It seems that kvprober will get fairly complex, and it's not clear that the complexity is really warranted. If we're worried that random probing doesn't touch a range frequently enough to trip alerting, why don't we change kvprober to retry a failing range indefinitely until it succeeds? Yes, we won't be able to look at the kvprober metrics and say "pseudo-randomly X percent of the cluster are unavailable" but it would've been a real stretch to say that we can do this today. Might be missing something.

joshimhoff · 2022-02-07T20:54:31Z

@tbg, is your response to the idea of trying to probe slow but available ranges as per @jmcarp's and I's discussion up above about latency? Or to the current state of the PR, which is focused on repeatedly probing "unavailable" ranges (more concretely, ranges that kvprober fails to probe given timeouts, etc.)?

If we're worried that random probing doesn't touch a range frequently enough to trip alerting, why don't we change kvprober to retry a failing range indefinitely until it succeeds?

The problem with this idea to me is then kvprober doesn't measure the "size" of an outage (e.g. the # of ranges affected). I think that is quite useful info! Is that what you mean by what is down below tho?

Yes, we won't be able to look at the kvprober metrics and say "pseudo-randomly X percent of the cluster are unavailable" but it would've been a real stretch to say that we can do this today.

If we measure errors over count of probes over an hour long window, if kvprober sends 1 QPS, and if there are N nodes in the cluster, then 60 * 60 * N ranges are probe in the hour long error rate, which is commonly shown on dashboards / used in alerting contexts. If N=3, that is ~10K ranges. On CC today, the largest cluster we run has 12K ranges.

We don't get cluster wide coverage at every time moment, and that is certainly an issue! But still I think your retry idea tho simple would hurt the usability of the prober in multi-range outage scenarios too much, and such scenarios are fairly common (e.g. they happen when cloud infra breaks across a zone or similar).

It seems that kvprober will get fairly complex, and it's not clear that the complexity is really warranted.

Do you want to take a peak at the PR if you haven't already? Certainly this PR adds complexity to kvprober, but neither the conceptual or code level complexity seem so bad to me.

I think you describe the main benefit of this PR well down below:

I expect the per-Replica circuit breakers (#33007) will be pretty good at shouting when replicas are unavailable, and this should lend itself nicely to use in alerting rules. The big value add I see in kvprober is that it exercises the stack above that, notably the DistSender routing layer, where we have two relatively common unavailability scenarios: routing problems, where we somehow never reach out to the right replica but keep spinning in circles, and complete range loss, where there is no reachable replica left to route to (in which case DistSender also retries indefinitely).

I also think for 22.2 KV & SRE should spend more time working together on KV alerting!

joshimhoff · 2022-02-16T14:25:55Z

Let's make a decision here, @jmcarp & @tbg. I can see doing one of two things:

Don't merge a PR like this one. Discuss at a higher level the goals of it in the 22.1 cycle with KV team. Also, consider setting up a logs-based alert on the existing kvprober metrics, when N=3+ probes to a specific range have failed. The time to alert will be high-ish (depends on number of nodes + number of ranges), but still the alert provides coverage in case where the circuit breaker doesn't find the issue faster.
Merge a PR like this one.

@tbg, see #76081 (comment) for a response to your suggestion to just retry failed probes. Though it is super simple at a code level, I don't particularly like that idea.

I don't feel strongly between 1 & 2, in particular given that we can still alert on single range issues with existing kvprober via a log-based alert (a good pattern for us to try out as it is useful in other situations).

tbg · 2022-02-16T16:01:43Z

My hesitancy mostly stems from not being sure what exactly we want kvprober to detect. I don't think it reliably gives us the "% of keyspace down" numbers (since it doesn't fire off probes open-loop and also doesn't randomize that much), so I'm just not sure if it is the source of truth I would use for such numbers and so I'm not sure if the non-sticky prober is as useful as you think. But I'm not an ops person and maybe it is! We could talk about that more.
Either way I like the sticky probing as I think it's a really useful observability tool and good at measuring the duration of "non healthy cluster". It doesn't measure the scope but I think that is a relatively difficult problem (that kvprober sort of addresses, but perhaps not in a 100% satisfying way).

Since I like the sticky probing, and that's what's being added here, assuming the code is ok I think we should add it. Then we can have a separate conversation about "breadth-first" probing (i.e. original kvprober) and see if we want to make changes to it.

joshimhoff · 2022-02-16T16:37:10Z

We could talk about that more.

Sounds good! In general, I look forward to the 22.2 cycle as I am hoping KV & SRE can work more closely on alerting & observability.

Either way I like the sticky probing as I think it's a really useful observability tool and good at measuring the duration of "non healthy cluster".

Ack! I agree. It's nice to have this PR even with circuit breaker, since kvclient issues do happen (I think in particular about serverless given some recent issues).

Since I like the sticky probing, and that's what's being added here, assuming the code is ok I think we should add it.

Sounds good! @jmcarp, want to get this one ready for a real review, if you agree with the decision here? Let's def not do anything about latency. That would add too much complexity, and we can talk more about it as part of 22.2 planning, at a higher level, and with KV team.

joshimhoff · 2022-03-08T13:53:37Z

With a feasible to implement version of #71169 coming into view, I think we should close this for now!

joshimhoff · 2022-05-12T14:37:01Z

I've come back around to some version of this idea being a good one. I'd really like it if had high confidence in our ability to detect single range outages in CC, even if with a time delay given that all ranges are not probed at every second. And tho I'm excited about #71169, (i) we don't have it yet and (ii) it's sort of orthogonal to the idea of detecting single range outages at high confidence.

@tbg suggested sticky probing:

If we're worried that random probing doesn't touch a range frequently enough to trip alerting, why don't we change kvprober to retry a failing range indefinitely until it succeeds?

I think I vote we do this but slightly tweaked. If a probe to some range fails, we probe it on every other probe (or similar). The other half the time, we probe via the planner. Then we get data on size of outage (not perfect but some data) and we also have confidence we will eventually catch single range outages (assuming not many CRDB restarts). I expect this won't add much complexity to the code? Maybe we should just code it up and compare it to this implementation.

One Q about this for Tobias: This means all nodes will eventually send a probe repeatedly to an unavailable range (assuming there is just one unavailable range). As the number of nodes grows, so will the probe rate on the unavailable range. Is this a problem? I guess one advantage of the implementation here is that the "problem range" probe rate is separate from the always-on planner-style probe rate, and so it can be set lower.

Also, CC @Santamaura for KV observability perspective.

tbg · 2022-05-23T11:57:43Z

Apologies for the delay in getting back. I have difficulty providing a snappy answer because I'm uncertain if we're taking the right approach (the prober was simple - now it gets more complicated; and are we just backtracking on its fundamental design?) and because other commitments that I consider more urgent require a bit of time management on my end.

I'll discuss next steps with @mwang1026 and @shralex tomorrow.

joshimhoff · 2022-05-23T13:28:30Z

No rush, and tell me if there are things I can do to reduce your workload when helping me out!

I'm also uncertain if we are taking the right approach. I'm not uncertain that we should be able to detect single range hard down outages at high confidence. I'm also not really uncertain that probing at a high level is a good technique for this. The problem isn't that a probe won't catch single range outages like the 6sense outage [1]; it is that we aren't sending probes in such a way that lead to an alert given the outage type (single range outage).

[1] We did caught this outage. But we caught it with sqlprober, not kvprober. If wasn't a system range, we wouldn't have caught it.

tbg · 2022-05-24T10:23:09Z

it is that we aren't sending probes in such a way that lead to an alert given that outage type.

Maybe we want a "quarantine" system. If you fail a probe, you get quarantined (if there's room, otherwise ignore quarantine). Ranges that are in quarantine get probed separately (to avoid slowing down the main prober) at an interval. We track the time of entrance into the quarantine pool and report the duration the oldest member has been there as a metric. Then you can alert on that (instead of on any metric about write failures), without a temporary blip messing up the alerting. With that setup, we don't have to rely on failing probes to determine whether there is an outage that's long enough to page someone. Instead, we get to look directly at the quantity we care about, namely how long it's been since we know of the outage. It would work like an exact mirror image of the "uptime" metric.

joshimhoff · 2022-05-26T15:07:54Z

I like that idea, @tbg! I'll try to find some time to give it a shot soon.

jmcarp requested review from dasrirez and joshimhoff February 4, 2022 20:17

joshimhoff reviewed Feb 4, 2022

View reviewed changes

joshimhoff mentioned this pull request Jun 14, 2022

admission,storage: introduce flush tokens to constrain write admission #82440

Merged

joshimhoff mentioned this pull request Jul 21, 2022

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

Closed

Santamaura mentioned this pull request Sep 6, 2022

kv: update kvprober with quarantine pool #87436

Merged

jmcarp closed this Aug 14, 2023

Conversation

jmcarp commented Feb 4, 2022

Uh oh!

cockroach-teamcity commented Feb 4, 2022

Uh oh!

joshimhoff commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshimhoff left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Feb 5, 2022

Uh oh!

jmcarp commented Feb 7, 2022

Uh oh!

tbg commented Feb 7, 2022

Uh oh!

joshimhoff commented Feb 7, 2022

Uh oh!

joshimhoff commented Feb 16, 2022

Uh oh!

tbg commented Feb 16, 2022

Uh oh!

joshimhoff commented Feb 16, 2022

Uh oh!

joshimhoff commented Mar 8, 2022

Uh oh!

joshimhoff commented May 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented May 23, 2022

Uh oh!

joshimhoff commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented May 24, 2022

Uh oh!

joshimhoff commented May 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joshimhoff commented Feb 4, 2022 •

edited

Loading

joshimhoff commented May 12, 2022 •

edited

Loading

joshimhoff commented May 23, 2022 •

edited

Loading