Is your feature request related to a problem? Please describe.
When an outage occurs, one of the most interesting questions (as far as the KV team is concerned) is whether any ranges in the system are unable to serve reads and/or writes. We further want to distinguish between ranges that are actively failing to do so and ranges that likely would (but are not currently being asked to do so). There is further a distinction between replicas and ranges - a range may have replicas that work just fine, but specific replicas could be unable to participate in, say, follower reads, or are unable to forward to the leader successfully.
We want to be able to write robust metrics-based alerting rules for unavailable ranges. When these signal, we also want to be able to discover easily which ranges/replicas are currently in trouble, and which kind of trouble they're in to determine next steps.
Currently none of this is trivial to determine. To determine whether a range is unable to make progress, one can look for messages like these (which indicate nonzero values for the correspondingly named metrics):
[‹n1,summaries›] ‹health alerts detected: {Alerts:[{StoreID:1 Category:METRICS Description:requests.slow.latch Value:82}]}›
[n6,summaries] health alerts detected: ‹{Alerts:[{StoreID:6 Category:METRICS Description:requests.slow.raft Value:1} {StoreID:6 Category:METRICS Description:requests.slow.lease Value:127}]}›
(NB: leases have a one minute timeout, so often they do not trigger this gauge in the first place, as the gauge is basically "lease proposals that have been taking longer than a minute").
or
[n4,summaries] health alerts detected: ‹{Alerts:[{StoreID:4 Category:METRICS Description:requests.slow.raft Value:23}
(NB: this metric is often not triggered when the range loses the lease, as then the only ongoing proposal is a lease, which as mentioned above has the 1min timeout)
or the ranges.unavailable metric (which however only counts replicas for which it believes a quorum of nodes is offline, which is only one possible reason for a range to be unavailable; additionally this metric does not track ranges for which all nodes are offline!)
Conceivably one could add an alerting rule on the sum combination of these metrics (they are all gauges).
This would leave open the problem of determining the ranges that triggered the metric. In practice, for that we refer to the logs, concretely: look for messages that take the form have been waiting. By convention, these are emitted whenever one of the metrics feeding into a "health alert " message is triggered, and refer to a source replica via the log tags.
Additionally, we are currently adding a black box kv probing mechanism which comes with additional metrics that indicate range issues. Its introduction was driven by frustration around the status quo and in particular the desire to have a clear source for alerting rules.
We arrive at the following picture:
| metric | populated via | misses | replica discovery |
|---|---|
| above subset of requests.slow.* | hanging requests | replicas without traffic | from log messages
| (prober metrics) | on failed probes | ranges it hasn't probed recently | from log messages
| ranges.unavailable | periodically by each node based on its replicas, but misses livelocks | ranges that have lost all replicas | probably the system.reports_* tables or problem ranges report; problems here typically affect large numbers of ranges
Describe the solution you'd like
TBD, for now just documenting the status quo and how it is lacking. One could make the argument that the mechanisms above complement each other and that all that's needed is some cleanup. One could also argue that this is a hodge-podge and makes alerting and quick debugging difficult.
At the very least, consistent structured logging should be applied to any of these alerts firing.
Likely for alerting purposes, all of these gauges should be converted to counters and should fire more aggressively, as I believe this to be better for alerting purposes (cc @joshimhoff). For example, the current lease timeout of 1min and the current gauge bump timeout of 1min for requests.slow.* likely give us false positives. An aggressive counter (10s) would avoid that.
Describe alternatives you've considered
TBD
Additional context
#33007 is related. It discusses circuit-breaking writes to ranges we know are unavailable.
Jira issue: CRDB-3066
Is your feature request related to a problem? Please describe.
When an outage occurs, one of the most interesting questions (as far as the KV team is concerned) is whether any ranges in the system are unable to serve reads and/or writes. We further want to distinguish between ranges that are actively failing to do so and ranges that likely would (but are not currently being asked to do so). There is further a distinction between replicas and ranges - a range may have replicas that work just fine, but specific replicas could be unable to participate in, say, follower reads, or are unable to forward to the leader successfully.
We want to be able to write robust metrics-based alerting rules for unavailable ranges. When these signal, we also want to be able to discover easily which ranges/replicas are currently in trouble, and which kind of trouble they're in to determine next steps.
Currently none of this is trivial to determine. To determine whether a range is unable to make progress, one can look for messages like these (which indicate nonzero values for the correspondingly named metrics):
(NB: leases have a one minute timeout, so often they do not trigger this gauge in the first place, as the gauge is basically "lease proposals that have been taking longer than a minute").
or
(NB: this metric is often not triggered when the range loses the lease, as then the only ongoing proposal is a lease, which as mentioned above has the 1min timeout)
or the
ranges.unavailablemetric (which however only counts replicas for which it believes a quorum of nodes is offline, which is only one possible reason for a range to be unavailable; additionally this metric does not track ranges for which all nodes are offline!)Conceivably one could add an alerting rule on the sum combination of these metrics (they are all gauges).
This would leave open the problem of determining the ranges that triggered the metric. In practice, for that we refer to the logs, concretely: look for messages that take the form
have been waiting. By convention, these are emitted whenever one of the metrics feeding into a "health alert " message is triggered, and refer to a source replica via the log tags.Additionally, we are currently adding a black box kv probing mechanism which comes with additional metrics that indicate range issues. Its introduction was driven by frustration around the status quo and in particular the desire to have a clear source for alerting rules.
We arrive at the following picture:
| metric | populated via | misses | replica discovery |
|---|---|
| above subset of
requests.slow.*| hanging requests | replicas without traffic | from log messages| (prober metrics) | on failed probes | ranges it hasn't probed recently | from log messages
|
ranges.unavailable| periodically by each node based on its replicas, but misses livelocks | ranges that have lost all replicas | probably thesystem.reports_*tables or problem ranges report; problems here typically affect large numbers of rangesDescribe the solution you'd like
TBD, for now just documenting the status quo and how it is lacking. One could make the argument that the mechanisms above complement each other and that all that's needed is some cleanup. One could also argue that this is a hodge-podge and makes alerting and quick debugging difficult.
At the very least, consistent structured logging should be applied to any of these alerts firing.
Likely for alerting purposes, all of these gauges should be converted to counters and should fire more aggressively, as I believe this to be better for alerting purposes (cc @joshimhoff). For example, the current lease timeout of 1min and the current gauge bump timeout of 1min for
requests.slow.*likely give us false positives. An aggressive counter (10s) would avoid that.Describe alternatives you've considered
TBD
Additional context
#33007 is related. It discusses circuit-breaking writes to ranges we know are unavailable.
Jira issue: CRDB-3066