Skip to content

kvserver: add metric for replica circuit breaker#74908

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:circ-breaker-metric
Jan 20, 2022
Merged

kvserver: add metric for replica circuit breaker#74908
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:circ-breaker-metric

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Jan 17, 2022

Add a metric that tracks how many replicas have tripped circuit
breakers.
Add a metric that counts the trip events as well. This can highlight
problems where the circuit is tripped in error and immediately untrips
(which may not be caught by the first metric).

Also, as requested by @mwang1026, report the latter via telemetry as
well.

Fixes #74505.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the circ-breaker-metric branch 2 times, most recently from 0a755be to b5b09b5 Compare January 18, 2022 15:27
@tbg tbg force-pushed the circ-breaker-metric branch 2 times, most recently from 90b51d4 to 0b8e5ea Compare January 19, 2022 08:03
@tbg tbg marked this pull request as ready for review January 19, 2022 08:03
@tbg tbg requested a review from a team as a code owner January 19, 2022 08:03
@tbg tbg force-pushed the circ-breaker-metric branch from 0b8e5ea to f0fcaef Compare January 19, 2022 08:04
@tbg tbg requested a review from aliher1911 January 19, 2022 08:05
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jan 19, 2022

@aliher1911 would you mind reviewing this since @erikgrinaker is out? If there are any nontrivial questions feel free to call me up so we can chat about it.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo some minor comments.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jan 19, 2022

TFTR! Comments addressed, but looks like I have something to look into:

Error Trace:	client_replica_circuit_breaker_test.go:121
        	Error:      	Not equal: 
        	            	expected: int(0)
        	            	actual  : int64(-1)
        	Test:       	TestReplicaCircuitBreaker_LeaseholderTripped

I suspect we're not always emitting the correct number of "Reset" events.

@tbg tbg force-pushed the circ-breaker-metric branch from f0fcaef to d43b241 Compare January 19, 2022 12:09
tbg added 2 commits January 19, 2022 17:34
It was possible for probes to be launched when the breaker was not
tripped. This generated `OnReset` events that were not accompanied
by preceding `OnTrip(nil, err)` events. We're trying to use these
events to keep track of how many breakers are tripped, so it makes
sense to have them pair up.

With the new invariant checks added, this failed immediately under
stress prior to this commit, in `TestBreakerRealistic`.

Release note: None
Add a metric that tracks how many replicas have tripped circuit
breakers.
Add a metric that counts the trip events as well. This can highlight
problems where the circuit is tripped in error and immediately untrips
(which may not be caught by the first metric).

Since tripped circuit breakers highlight an availability problem,
we're also adding an alert/aggregation rule.

Also, as requested by @mwang1026, report the count of trip events via
telemetry as well.

Fixes cockroachdb#74505.

Release note: None
@tbg tbg force-pushed the circ-breaker-metric branch from d43b241 to 442e5b9 Compare January 19, 2022 16:36
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jan 19, 2022

Got the sucker, prepended a commit.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jan 20, 2022

bors r=erikgrinaker

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 20, 2022

Build succeeded:

@craig craig bot merged commit e8a0b75 into cockroachdb:master Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: add metrics for per-Replica circuit breakers

3 participants