kvserver: add metric for replica circuit breaker#74908
Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom Jan 20, 2022
Merged
kvserver: add metric for replica circuit breaker#74908craig[bot] merged 2 commits intocockroachdb:masterfrom
craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
Member
0a755be to
b5b09b5
Compare
90b51d4 to
0b8e5ea
Compare
0b8e5ea to
f0fcaef
Compare
Member
Author
|
@aliher1911 would you mind reviewing this since @erikgrinaker is out? If there are any nontrivial questions feel free to call me up so we can chat about it. |
erikgrinaker
approved these changes
Jan 19, 2022
Contributor
erikgrinaker
left a comment
There was a problem hiding this comment.
LGTM, modulo some minor comments.
Member
Author
|
TFTR! Comments addressed, but looks like I have something to look into: I suspect we're not always emitting the correct number of "Reset" events. |
f0fcaef to
d43b241
Compare
It was possible for probes to be launched when the breaker was not tripped. This generated `OnReset` events that were not accompanied by preceding `OnTrip(nil, err)` events. We're trying to use these events to keep track of how many breakers are tripped, so it makes sense to have them pair up. With the new invariant checks added, this failed immediately under stress prior to this commit, in `TestBreakerRealistic`. Release note: None
Add a metric that tracks how many replicas have tripped circuit breakers. Add a metric that counts the trip events as well. This can highlight problems where the circuit is tripped in error and immediately untrips (which may not be caught by the first metric). Since tripped circuit breakers highlight an availability problem, we're also adding an alert/aggregation rule. Also, as requested by @mwang1026, report the count of trip events via telemetry as well. Fixes cockroachdb#74505. Release note: None
d43b241 to
442e5b9
Compare
Member
Author
|
Got the sucker, prepended a commit. |
Member
Author
|
bors r=erikgrinaker |
Contributor
|
Build succeeded: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a metric that tracks how many replicas have tripped circuit
breakers.
Add a metric that counts the trip events as well. This can highlight
problems where the circuit is tripped in error and immediately untrips
(which may not be caught by the first metric).
Also, as requested by @mwang1026, report the latter via telemetry as
well.
Fixes #74505.
Release note: None