Skip to content

transport: Add values to the grpc.disconnect_error label for grpc.subchannel.disconnections metric (A94)#8973

Merged
mbissa merged 15 commits into
grpc:masterfrom
mbissa:subchannel-disconnection-unknown-reason
Mar 31, 2026
Merged

transport: Add values to the grpc.disconnect_error label for grpc.subchannel.disconnections metric (A94)#8973
mbissa merged 15 commits into
grpc:masterfrom
mbissa:subchannel-disconnection-unknown-reason

Conversation

@mbissa

@mbissa mbissa commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

This PR implements granular grpc.disconnect_error labels for the grpc.subchannel.disconnections metric, as defined in gRFC A94.

RELEASE NOTES:

  • transport: Add disconnection reason to the grpc.disconnect_error label for grpc.subchannel.disconnections metric as defined in gRFC A94.

@mbissa mbissa added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Mar 13, 2026
@mbissa mbissa added this to the 1.81 Release milestone Mar 13, 2026
@codecov

codecov Bot commented Mar 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.06%. Comparing base (12e91dd) to head (e30f27d).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
clientconn.go 93.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8973      +/-   ##
==========================================
+ Coverage   83.04%   83.06%   +0.01%     
==========================================
  Files         411      411              
  Lines       32892    32988      +96     
==========================================
+ Hits        27316    27402      +86     
- Misses       4181     4191      +10     
  Partials     1395     1395              
Files with missing lines Coverage Δ
internal/transport/http2_client.go 92.37% <100.00%> (-0.67%) ⬇️
internal/transport/transport.go 89.06% <ø> (-2.09%) ⬇️
clientconn.go 90.74% <93.33%> (-0.30%) ⬇️

... and 38 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbissa mbissa changed the title transport: Add values for grpc.disconnect_error label for grpc.subchannel.disconnections metric (A94) transport: Add values to the grpc.disconnect_error label for grpc.subchannel.disconnections metric (A94) Mar 13, 2026
@mbissa mbissa force-pushed the subchannel-disconnection-unknown-reason branch from dedb4b2 to 06fb986 Compare March 13, 2026 07:42
@mbissa mbissa force-pushed the subchannel-disconnection-unknown-reason branch from 215dc6c to de97023 Compare March 13, 2026 09:36
@mbissa

mbissa commented Mar 13, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully implements gRFC A94 by adding more granular grpc.disconnect_error labels to the grpc.subchannel.disconnections metric. The changes are well-implemented, introducing a disconnectError field in addrConn and a disconnectErrorString helper to map various error conditions to the new labels. The transport layer modifications to propagate the necessary error details are correct. The addition of comprehensive end-to-end tests is a great way to ensure the new labels are correctly reported in different disconnection scenarios. I have one minor suggestion to clean up a duplicated comment.

Comment thread internal/transport/http2_client.go
@mbissa mbissa requested a review from easwars March 13, 2026 10:41
@easwars easwars self-assigned this Mar 16, 2026
Comment thread internal/transport/http2_client.go Outdated
Comment thread clientconn.go Outdated
Comment thread clientconn.go Outdated
@easwars easwars assigned mbissa and unassigned easwars Mar 16, 2026
@mbissa mbissa force-pushed the subchannel-disconnection-unknown-reason branch from 1200625 to 5584fb4 Compare March 17, 2026 06:47
@mbissa mbissa force-pushed the subchannel-disconnection-unknown-reason branch from f219fe9 to 88cd619 Compare March 17, 2026 07:32
@mbissa

mbissa commented Mar 17, 2026

Copy link
Copy Markdown
Contributor Author

fixed the comments, one test flaked once due to timing of how the connection was closed, so changed the test to be more deterministic. Master branch had new tests which were failing now, so couple of minor changes for that as well.

@mbissa mbissa assigned easwars and unassigned mbissa Mar 17, 2026
Comment thread clientconn.go Outdated
Comment thread internal/transport/http2_client.go Outdated
Comment thread balancer/pickfirst/metrics_test.go Outdated
Comment thread balancer/pickfirst/metrics_test.go
Comment thread balancer/pickfirst/metrics_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread clientconn.go
@easwars easwars assigned mbissa and unassigned easwars Mar 23, 2026
@mbissa mbissa assigned easwars and unassigned mbissa Mar 25, 2026
@mbissa

mbissa commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

I realize the tests are not very idiomatic - I will structure them into a table and push one more commit.

@mbissa mbissa assigned easwars and unassigned easwars Mar 25, 2026
Comment thread balancer/pickfirst/metrics_test.go
Comment thread balancer/pickfirst/metrics_test.go Outdated
Comment thread balancer/pickfirst/metrics_test.go Outdated
Comment thread balancer/pickfirst/metrics_test.go Outdated

func (s) TestDisconnectLabel(t *testing.T) {
// 1. Valid GOAWAY
// Server GracefulStop sends GOAWAY with active streams = 0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does active streams have to do with anything that is happening with regards to this test?

@mbissa mbissa Mar 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to explain that we straightaway go to close the stream and expect GOAWAY. It does not have specific bearing on the value of the label itself.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please have descriptive comments for each of the three subtests.

And there are no active streams because the runDisconnectLabelTest only performs a unary RPC before invoking the triggerFunc. Yeah, some of things would be nice to be clearly explained in the comments.

My philosophy with tests is that they have to be as readable as possible, so that when someone lands on it (either because they are debugging a failed test or because they are trying to understand how the code being tested works), they should be able to very quickly understand what the test is doing and what it is expecting. The "how" part is usually less important, and as long as the "what"s are clearly documented, the reader will have a much easier time.

Comment thread balancer/pickfirst/metrics_test.go Outdated
return gotMetrics
}

func (s) TestDisconnectLabel(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be more disconnect reason labels than what are being tested here. Are looks like they are covered in otel e2e tests? Why do we cover only a subset here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to just check that the plumbing works, e2e tests verify all scenarios.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a docstring that mentions this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread internal/transport/http2_client.go
Comment thread clientconn.go
Comment on lines +1590 to +1595
default:
var sysErr syscall.Errno
if errors.As(err, &sysErr) {
return "socket error"
}
return "unknown"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If you defined the sysErr variable at the top of the switch, you could add a case for it, instead of folding it into the default case

	switch {
    // Existing cases
	case errors.As(err, &sysErr):
		return "socket error"
    default:
		return "unknown"
	}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you missed this or decided not to implement it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed pushing that change, done now.

@easwars easwars assigned mbissa and unassigned easwars Mar 26, 2026
@mbissa mbissa assigned easwars and unassigned mbissa Mar 30, 2026

@easwars easwars left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo some minor comments

return gotMetrics
}

func (s) TestDisconnectLabel(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a docstring that mentions this.

Comment thread balancer/pickfirst/metrics_test.go Outdated

func (s) TestDisconnectLabel(t *testing.T) {
// 1. Valid GOAWAY
// Server GracefulStop sends GOAWAY with active streams = 0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please have descriptive comments for each of the three subtests.

And there are no active streams because the runDisconnectLabelTest only performs a unary RPC before invoking the triggerFunc. Yeah, some of things would be nice to be clearly explained in the comments.

My philosophy with tests is that they have to be as readable as possible, so that when someone lands on it (either because they are debugging a failed test or because they are trying to understand how the code being tested works), they should be able to very quickly understand what the test is doing and what it is expecting. The "how" part is usually less important, and as long as the "what"s are clearly documented, the reader will have a much easier time.

Comment thread clientconn.go
Comment on lines +1590 to +1595
default:
var sysErr syscall.Errno
if errors.As(err, &sysErr) {
return "socket error"
}
return "unknown"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you missed this or decided not to implement it.

Comment thread internal/transport/transport.go Outdated
Comment on lines +756 to +758
// If the connection was closed by a GOAWAY frame, this will usually be a
// connection error that describes the connection closing.
Err error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand this last sentence. If the transport is being closed becasue of the receipt of a GOAWAY, I see that we usually don't set this field. Is that not true? Can this comment be made more easily understandable. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, I have fixed the comment.

Comment thread clientconn.go
Comment on lines +1704 to +1706
if ac.disconnectErrorLabel == "" {
ac.disconnectErrorLabel = "subchannel shutdown"
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment was supposed to be on top the line that sets the conenctivity state to Shutdown. Can we continue to retain it that way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@easwars easwars assigned mbissa and unassigned easwars Mar 30, 2026
@mbissa mbissa merged commit 34da8d0 into grpc:master Mar 31, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants