transport: Add values to the grpc.disconnect_error label for grpc.subchannel.disconnections metric (A94)#8973
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8973 +/- ##
==========================================
+ Coverage 83.04% 83.06% +0.01%
==========================================
Files 411 411
Lines 32892 32988 +96
==========================================
+ Hits 27316 27402 +86
- Misses 4181 4191 +10
Partials 1395 1395
🚀 New features to boost your workflow:
|
dedb4b2 to
06fb986
Compare
215dc6c to
de97023
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request successfully implements gRFC A94 by adding more granular grpc.disconnect_error labels to the grpc.subchannel.disconnections metric. The changes are well-implemented, introducing a disconnectError field in addrConn and a disconnectErrorString helper to map various error conditions to the new labels. The transport layer modifications to propagate the necessary error details are correct. The addition of comprehensive end-to-end tests is a great way to ensure the new labels are correctly reported in different disconnection scenarios. I have one minor suggestion to clean up a duplicated comment.
1200625 to
5584fb4
Compare
f219fe9 to
88cd619
Compare
|
fixed the comments, one test flaked once due to timing of how the connection was closed, so changed the test to be more deterministic. Master branch had new tests which were failing now, so couple of minor changes for that as well. |
|
I realize the tests are not very idiomatic - I will structure them into a table and push one more commit. |
|
|
||
| func (s) TestDisconnectLabel(t *testing.T) { | ||
| // 1. Valid GOAWAY | ||
| // Server GracefulStop sends GOAWAY with active streams = 0. |
There was a problem hiding this comment.
What does active streams have to do with anything that is happening with regards to this test?
There was a problem hiding this comment.
This is just to explain that we straightaway go to close the stream and expect GOAWAY. It does not have specific bearing on the value of the label itself.
There was a problem hiding this comment.
Can we please have descriptive comments for each of the three subtests.
And there are no active streams because the runDisconnectLabelTest only performs a unary RPC before invoking the triggerFunc. Yeah, some of things would be nice to be clearly explained in the comments.
My philosophy with tests is that they have to be as readable as possible, so that when someone lands on it (either because they are debugging a failed test or because they are trying to understand how the code being tested works), they should be able to very quickly understand what the test is doing and what it is expecting. The "how" part is usually less important, and as long as the "what"s are clearly documented, the reader will have a much easier time.
| return gotMetrics | ||
| } | ||
|
|
||
| func (s) TestDisconnectLabel(t *testing.T) { |
There was a problem hiding this comment.
There seem to be more disconnect reason labels than what are being tested here. Are looks like they are covered in otel e2e tests? Why do we cover only a subset here?
There was a problem hiding this comment.
The goal here is to just check that the plumbing works, e2e tests verify all scenarios.
There was a problem hiding this comment.
Can we please add a docstring that mentions this.
| default: | ||
| var sysErr syscall.Errno | ||
| if errors.As(err, &sysErr) { | ||
| return "socket error" | ||
| } | ||
| return "unknown" |
There was a problem hiding this comment.
Nit: If you defined the sysErr variable at the top of the switch, you could add a case for it, instead of folding it into the default case
switch {
// Existing cases
case errors.As(err, &sysErr):
return "socket error"
default:
return "unknown"
}There was a problem hiding this comment.
Not sure if you missed this or decided not to implement it.
There was a problem hiding this comment.
missed pushing that change, done now.
easwars
left a comment
There was a problem hiding this comment.
LGTM, modulo some minor comments
| return gotMetrics | ||
| } | ||
|
|
||
| func (s) TestDisconnectLabel(t *testing.T) { |
There was a problem hiding this comment.
Can we please add a docstring that mentions this.
|
|
||
| func (s) TestDisconnectLabel(t *testing.T) { | ||
| // 1. Valid GOAWAY | ||
| // Server GracefulStop sends GOAWAY with active streams = 0. |
There was a problem hiding this comment.
Can we please have descriptive comments for each of the three subtests.
And there are no active streams because the runDisconnectLabelTest only performs a unary RPC before invoking the triggerFunc. Yeah, some of things would be nice to be clearly explained in the comments.
My philosophy with tests is that they have to be as readable as possible, so that when someone lands on it (either because they are debugging a failed test or because they are trying to understand how the code being tested works), they should be able to very quickly understand what the test is doing and what it is expecting. The "how" part is usually less important, and as long as the "what"s are clearly documented, the reader will have a much easier time.
| default: | ||
| var sysErr syscall.Errno | ||
| if errors.As(err, &sysErr) { | ||
| return "socket error" | ||
| } | ||
| return "unknown" |
There was a problem hiding this comment.
Not sure if you missed this or decided not to implement it.
| // If the connection was closed by a GOAWAY frame, this will usually be a | ||
| // connection error that describes the connection closing. | ||
| Err error |
There was a problem hiding this comment.
I don't completely understand this last sentence. If the transport is being closed becasue of the receipt of a GOAWAY, I see that we usually don't set this field. Is that not true? Can this comment be made more easily understandable. Thanks.
There was a problem hiding this comment.
correct, I have fixed the comment.
| if ac.disconnectErrorLabel == "" { | ||
| ac.disconnectErrorLabel = "subchannel shutdown" | ||
| } |
There was a problem hiding this comment.
The above comment was supposed to be on top the line that sets the conenctivity state to Shutdown. Can we continue to retain it that way.
This PR implements granular grpc.disconnect_error labels for the grpc.subchannel.disconnections metric, as defined in gRFC A94.
RELEASE NOTES: