Skip to content

client: Change connectivity state to CONNECTING when creating the name resolver#8710

Merged
easwars merged 24 commits into
grpc:masterfrom
easwars:move_to_connecting_when_exiting_idle
Dec 9, 2025
Merged

client: Change connectivity state to CONNECTING when creating the name resolver#8710
easwars merged 24 commits into
grpc:masterfrom
easwars:move_to_connecting_when_exiting_idle

Conversation

@easwars

@easwars easwars commented Nov 14, 2025

Copy link
Copy Markdown
Contributor

Fixes #7686

Current Behavior

  • When client exits IDLE and creates the name resolver, it stays in IDLE until the connectivity state is set by the LB policy.
  • When exiting IDLE mode (because of Connect being called or because of an RPC), if name resolver creation fails, we stay in IDLE.

New Behavior

  • When the client exits IDLE and creates the name resolver, it moves to CONNECTING. Moving forward, the connectivity state will be set by the LB policy.
  • When exiting IDLE mode (because of Connect being called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer fires

Implementation details:

  • The client channel now treats resolver build errors encountered during exiting IDLE identically to resolver errors received prior to valid updates.
  • Build uses a new unsafe API on the idleness manager to mark the channel as exited IDLE.
  • The idleness Manager invokes the channel's ExitIdleMode (which now does not return an error) and updates internal state to reflect that it is no longer in IDLE.
  • OnFinish call options are now invoked even if stream creation fails during an RPC. This fulfills the guarantee for these options and ensures the idleness Manager’s activeCallsCount remains accurate.

RELEASE NOTES:

  • client: Change connectivity state to CONNECTING when creating the name resolver (as part of exiting IDLE).
  • client: Change connectivity state to TRANSIENT_FAILURE if name resolver creation fails (as part of exiting IDLE).
  • client: Change connectivity state to IDLE after idle timeout expires even when current state is TRANSIENT_FAILURE.
  • client: Fix a bug that resulted in OnFinish call option not being invoked for RPCs where stream creation failed.

@easwars easwars requested a review from dfawley November 14, 2025 21:48
@easwars easwars added Type: Bug Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Nov 14, 2025
@easwars easwars added this to the 1.78 Release milestone Nov 14, 2025
@easwars easwars requested a review from arjan-bal November 14, 2025 21:48
@codecov

codecov Bot commented Nov 14, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.45283% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.34%. Comparing base (112ec12) to head (fe31e2a).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
clientconn.go 87.50% 1 Missing and 1 partial ⚠️
internal/idle/idle.go 95.00% 0 Missing and 1 partial ⚠️
resolver_wrapper.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8710      +/-   ##
==========================================
+ Coverage   83.28%   83.34%   +0.05%     
==========================================
  Files         416      418       +2     
  Lines       32267    32385     +118     
==========================================
+ Hits        26874    26990     +116     
- Misses       4019     4024       +5     
+ Partials     1374     1371       -3     
Files with missing lines Coverage Δ
stream.go 82.11% <100.00%> (+0.27%) ⬆️
internal/idle/idle.go 89.28% <95.00%> (+0.12%) ⬆️
resolver_wrapper.go 91.58% <0.00%> (-0.87%) ⬇️
clientconn.go 90.30% <87.50%> (+0.17%) ⬆️

... and 48 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread dial_test.go Outdated

func (s stringerVal) String() string { return s.s }

const errResolverBuildercheme = "test-resolver-build-failure"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread resolver_wrapper.go Outdated
Comment on lines +84 to +92
// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md
// defines CONNECTING as follows:
// - The channel is trying to establish a connection and is waiting to
// make progress on one of the steps involved in name resolution, TCP
// connection establishment or TLS handshake. This may be used as the
// initial state for channels upon creation.
//
// We are starting the name resolver here as part of exiting IDLE, so
// transitioning to CONNECTING is the right thing to do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO comments should be short and to the point.

Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.

Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:

// Set state to CONNECTING before building the name resolver
// so the channel does not remain in IDLE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +740 to +742
if state := cc.GetState(); state != connectivity.Idle {
t.Fatalf("Expected initial state to be IDLE, got %v", state)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AwaitState above already tested this IIUC

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Ensure that the client is in IDLE before connecting.
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
defer cancel()
testutils.AwaitState(ctx, t, cc, connectivity.Idle)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Moved the check for the current state to here, and got rid of the Await.

Comment thread test/clientconn_state_transition_test.go Outdated
Comment thread resolver_wrapper.go Outdated
Comment thread resolver_balancer_ext_test.go Outdated
// Tests the case where the resolver reports an error to the channel before
// reporting an update. Verifies that the channel eventually moves to
// TransientFailure and a subsequent RPC returns the error reported by the
// TransientFailure and a subsequent RPCs returns the error reported by the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ a / /

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread test/clientconn_state_transition_test.go Outdated
Comment thread resolver_wrapper.go Outdated
Authority: ccr.cc.authority,
MetricsRecorder: ccr.cc.metricsRecorderList,
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this diff & file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread clientconn.go Outdated
Comment thread clientconn.go Outdated
@dfawley dfawley assigned easwars and unassigned dfawley Dec 1, 2025
@easwars

easwars commented Dec 2, 2025

Copy link
Copy Markdown
Contributor Author

@dfawley
I have the changes ready for the above two comments, but I want to hear from you before sending a commit for it. Thanks.

@easwars easwars assigned dfawley and unassigned easwars Dec 2, 2025
Comment thread resolver_balancer_ext_test.go
dopts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithResolvers(&testResolverBuilder{logger: t, manualR: mr}),
grpc.WithIdleTimeout(time.Second),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?

One solution might be to split this into two tests:

  1. Test that setting a sufficiently large idle timeout keeps the channel in TF and doesn't trigger the resolver builder again.
  2. Set a defaultTestShortIdleTimeout and ensure the channel exits TF, calling the resolver builder again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a way to enter IDLE mode from tests. But looks like that goes through the regular idleness logic. So, I added another one to forcefully put the channel in IDLE (for testing purposes), and changed the test to forcefully enter IDLE after it enters TF.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@easwars the way the test can flake is if there's a gap of more than 1s b/w the two RPC attempts. The sequence of events is:

  1. RPC 1 is made, results in testResolverBuilder.Build being called and failing. RPC fails with code UNAVAILABLE.
  2. 1s passes, the channel enters IDLE mode.
  3. RPC 2 is made, results in testResolverBuilder.Build being called. Since the builder only returns an error for the first call, the Build call succeeds this time. The RPC doesn't fail with the expected error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is a valid sequence of events. But there is nothing else happening in this test at all when that for loop making the two RPCs is running. And the RPCs don't even get far enough to create an LB policy, create subchannels, create transports, none of that good stuff is happening. While I agree it could possible take a second after the first RPC runs before the second one gets to run, I feel the probability is miniscule.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 1s is sufficient, but it introduces a fixed latency—the test will always take the full second even if it could finish in milliseconds. It's not a blocker for this PR, but something to keep in mind.

Comment thread stream.go Outdated
Comment thread clientconn.go Outdated

@dfawley dfawley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great now, I think this is what we want. I think we should take about naming a bit (which is just textual changes), and I'm concerned about the ForceIdle thing. Otherwise LGTM!

Comment thread internal/idle/idle.go Outdated
Comment on lines +256 to +259
// MarkAsExitedIdleMode instructs the Manager to update its internal state to
// indicate that the channel has exited IDLE mode. This is only used by the gRPC
// client when it exits IDLE mode manually from Dial.
func (m *Manager) MarkAsExitedIdleMode() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs warnings on it. Generally you can't ever use it in a situation where you can possibly race with anything else calling ExitIdleMode or OnCallBegin, and you must know you are really idle.

Bikeshedding:

UnsafeSetNotIdle()? UnsafeSetActive()? InitializeAsActive() (to convey this is only to be used as part of creation)?

Aside but very related: Do we already say "active" is the opposite of "idle" anywhere? Or are the only states "in idle mode" and "not in idle mode". The latter feels a little awkward. Maybe we can rename things to use "active" vs "idle":

ExitIdleMode() -> Activate()
EnterIdleModeForTesting() -> GoIdle()  // Do we really need to ForTesting this? It seems safe to use.

Further bikeshedding that's unrelated: I never found the "Enforcer" name to be intuitive either. I've always thought of the manager as the thing doing the enforcing. The object it's controlling is just enacting the idle manager's commands. How about Actor or Agent or Worker or Channel instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs warnings on it

Improved the docstring to make it sound more scary.

Bikeshedding

Went with UnsafeSetNotIdle

Do we already say "active" is the opposite of "idle" anywhere?

Yeah, we have never used the term "active" to mean the opposite of "idle" anywhere so far.

Maybe we can rename things to use "active" vs "idle"

We probably could do that. Can we do it in a follow up though?

EnterIdleModeForTesting() -> GoIdle() // Do we really need to ForTesting this? It seems safe to use.

Yes, we need this to test the race between the channel going idle after idle_timeout firing and an RPC trying to keep it from not going idle.

I never found the "Enforcer" name to be intuitive either.

I liked Channel since it conveyed the meaning directly, but eventually decided to go with ClientConn to keep it consistent with other places where an interface's functionality is provided by the client channel.

Comment thread internal/idle/idle.go Outdated

// ForceEnterIdleModeForTesting bypasses the usual checks that happen before
// entering idle mode, and forcefully enter idle mode for testing purposes.
func (m *Manager) ForceEnterIdleModeForTesting() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? It feels pretty dangerous to even have it. What are we doing that requires us to enter idle mode even with pending calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so that we can force entering IDLE from tests after making RPCs (and completing those RPCs), but not having to set a small idle_timeout and be vulnerable to timing flakes.

Essentially instead of setting an idle_timeout of say 1s, making an RPC, and waiting for a second (or a little more) for the channel to enter idle, the test could make an RPC, and once the RPC is complete, it can force the channel to enter idle becuase it knows there are no pending calls.

The existing EnterIdleModeForTesting which I've renamed to TryEnterIdleModeForTesting works within the parameters of idleness and moves the channel to idle mode only if there has been no activity on the channel. We could have named it as FireIdleTimeoutForTesting because that is essentially what is happening as part of this method and is required for testing races between when the channel is trying to enter idle (as part of the idle timeout firing) and exit idle (as part an RPC).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this thread though: #8710 (comment)

Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?

Actually the code wasn't assuming that a 1s interval was sufficient for 2 attempts to fail. The 1s interval just determined when the idle timeout would fire and everytime it fired, it would check if the channel had no activity since the last time it fired. I'm going to go back to the previous approach for the following reasons:

  • 1s should be plenty for these two attempts to fail because nothing happens in these two attempts, except the attempt to build the resolver which would fail, which would result in the RPC failing
  • Even if the two attempts took more than 1s, it wouldn't affect the correctness of the test as the idle timeout would fire and the idleness manager would see that there was activity on the channel and therefore would do nothing. And the next time it fires (or the next time), it would eventually find no activity on the channel and would move it to idle

This would allow me to get rid of this ForceEnterIdleModeForTesting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Why would we want to simulate the idle timeout firing? Anything operating at that low of a level should be internal to the idle package. Everything else should have a safe way to say "go idle now -- unless there are pending RPCs by the time this call gets in and the idle lock is taken".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also. After reading your last comment, I agree that the new name FireIdleTimeoutForTesting seems very low level and can be changed back to EnterIdleModeForTesting. The problem though is that it is possible that it doesn't enter idle mode, because it uses the same logic that non-test code would use to decide if it should try entering idle mode or not. The problem with using separate logic in tests to enter idle mode is that we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in the initial thread about how the 1s interval can cause the test to flake: #8710 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).

We shouldn't need these kinds of tests in the channel, though, should we? From the channel's perspective, it needs to properly deal with the race between RPC start and a call to EnterIdle() from the idle manager. It doesn't know or care about the mechanics of the timer firing. Testing the idle manager might require such unit-style tests, in which case "simulating the timer firing" can be a non-exported function. Maybe I'm missing something though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need these kinds of tests in the channel, though, should we?

Looks like we have a unit style test for exactly the same thing and it does exactly what you are saying: one goroutine trying to fire the idle timeout and another one trying to call OnCallBeing to start an RPC

So, I ended up removing the test that triggers the same from the channel.

@easwars easwars assigned arjan-bal and unassigned easwars Dec 4, 2025

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, there's still a discussion about removing ForceEnterIdleModeForTesting. I'm fine with either outcome.

@arjan-bal arjan-bal removed their assignment Dec 5, 2025
Comment thread internal/idle/idle.go Outdated
m.tryEnterIdleMode()
// FireIdleTimeoutForTesting forcefully triggers the idle timeout to fire.
func (m *Manager) FireIdleTimeoutForTesting() {
m.handleIdleTimeout()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your other comment:

Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also.

But...this is a behavior change, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've brought back EnterIdleModeForTesting now, but the implementation takes inspiration from how to exit idle from Dial. So the implementation now calls the unexported enterIdleMode on the clientconn and calls a method on the idleness manager to let it know that we have entered idle (so go update your internal state).

This also means that I don't have to use the 1s idle_timeout from one of those tests, and also not directly cause the idle timeout to fire from the test, but instead use a higher level API to ask the clientconn to enter idle mode.

@easwars

easwars commented Dec 5, 2025

Copy link
Copy Markdown
Contributor Author

@dfawley
I'm hoping the PR now should have addressed all outstanding concerns. Please take another look.

Comment thread internal/idle/idle.go Outdated
//
// N.B. This method is intended only for testing purposes. The caller must
// ensure that there are no ongoing RPCs
func (m *Manager) UnsafeSetIdleForTesting() error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer this over the old version? This is dangerous while the old one was safe. And the old one will work in all the same scenarios that the new one will work in (when you know for sure the channel is not able to perform RPCs), but not vice-versa.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I prefer this one over the previous version is because the idleness manager not only tracks the ongoing RPCs, but also when the last one completed. The latter means that when tests have performed an RPC before wanting to push the channel into idle mode, they need to wait for idle_timeout to fire. Or set the idle_timeout to a really low value and be susceptible (sometimes with very low probabilities) to timing flakes. All previous usages of the old removed EnterIdleModeForTesting were from tests that never made an RPC. But now we have tests that do.

Comment thread clientconn.go Outdated
Comment thread resolver_wrapper.go
Comment thread stream.go Outdated
Comment on lines +188 to +198
if channelz.IsOn() {
cc.incrCallsFailed()
}
// Invoke all the registered OnFinish call options explicitly. A
// non-nil error means that the stream wasn't created, and
// therefore these will be NOT be invoked as part of `cs.finish()`.
for _, o := range opts {
if o, ok := o.(OnFinishCallOption); ok {
o.OnFinish(err)
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can find a better way to unify these things?

Maybe we can have a shared endOfRPC function that is called by finish and from here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread internal/idle/idle.go Outdated
Comment on lines +171 to +172
if checkActivity {
if atomic.LoadInt32(&m.activeSinceLastTimerCheck) == 1 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be merged to remove an indent.

Comment thread stream.go Outdated
var emptyMethodConfig = serviceconfig.MethodConfig{}

// endOfClientStream performs cleanup actions required for both successful and
// failed streams. This includes incrementing channelz stats and invoke all

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// failed streams. This includes incrementing channelz stats and invoke all
// failed streams. This includes incrementing channelz stats and invoking all

@easwars easwars merged commit e413838 into grpc:master Dec 9, 2025
15 checks passed
@easwars easwars deleted the move_to_connecting_when_exiting_idle branch December 9, 2025 00:03
agocs added a commit to pomerium/pomerium that referenced this pull request Dec 30, 2025
## Summary

#6011 wants to bump
`google.golang.org/grpc` from v1.77.0 to v1.78.0. [1] That is causing
TestLoadSessionHandle in github.com/pomerium/pomerium/proxy to time out
after 10 minutes. The tests create new proxies and call ServeHTTP on
test requests. That execution gets all the way to [GetDataBrokerRecord
in
querier.go](https://github.com/pomerium/pomerium/blob/ba0fcffe81f27f33dd5aa858dba3a1276208ddcd/pkg/storage/querier.go#L70-L95),
where it stalls out on `res, err := q.Query(ctx, req,
grpc.WaitForReady(true))`.

What I think is happening is 
`grpc.WaitForReady(true)` makes `q.Query` wait for the GRPC client to
connect to the server. [grpc-go
v1.78.0](https://github.com/grpc/grpc-go/releases/tag/v1.78.0) fixes a
[bug](grpc/grpc-go#8710) where the client would
stay in the IDLE state while trying to connect to the GRPC server. What
happens now is the client moves to the TRANSIENT_FAILURE state. The
WaitForReady option sees TRANSIENT_FAILURE and waits until that failure
fixes itself.

By setting a timeout in the context, we pass that context to `q.Query`
eventually, the timeout fires, and GetDataBrokerRecord returns, which is
what the test expected all along.

[1]: I bisected the go.mod file in that branch to figure out which
dependency was causing the test to time out.

## Related issues

#6011
grpc/grpc-go#8710

## User Explanation

This fixes a bug that was causing a test to fail. The test was relying
on a bug in the grpc-go client, which was fixed.

## Checklist

- [x] reference any related issues
- [x] updated unit tests
- [x] add appropriate label (`enhancement`, `bug`, `breaking`,
`dependencies`, `ci`)
- [x] ready for review
nschloe pushed a commit to live-clones/forgejo that referenced this pull request May 28, 2026
…jo) (#12794)

This PR contains the following updates:

| Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [google.golang.org/grpc](https://github.com/grpc/grpc-go) | `v1.75.0` → `v1.79.3` | ![age](https://developer.mend.io/api/mc/badges/age/go/google.golang.org%2fgrpc/v1.79.3?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/go/google.golang.org%2fgrpc/v1.75.0/v1.79.3?slim=true) |

---

### gRPC-Go has an authorization bypass via missing leading slash in :path
[CVE-2026-33186](https://nvd.nist.gov/vuln/detail/CVE-2026-33186) / [GHSA-p77j-4mvh-x3m3](GHSA-p77j-4mvh-x3m3) / [GO-2026-4762](https://pkg.go.dev/vuln/GO-2026-4762)

<details>
<summary>More information</summary>

#### Details
##### Impact
_What kind of vulnerability is it? Who is impacted?_

It is an **Authorization Bypass** resulting from **Improper Input Validation** of the HTTP/2 `:path` pseudo-header.

The gRPC-Go server was too lenient in its routing logic, accepting requests where the `:path` omitted the mandatory leading slash (e.g., `Service/Method` instead of `/Service/Method`). While the server successfully routed these requests to the correct handler, authorization interceptors (including the official `grpc/authz` package) evaluated the raw, non-canonical path string. Consequently, "deny" rules defined using canonical paths (starting with `/`) failed to match the incoming request, allowing it to bypass the policy if a fallback "allow" rule was present.

**Who is impacted?**
This affects gRPC-Go servers that meet both of the following criteria:
1. They use path-based authorization interceptors, such as the official RBAC implementation in `google.golang.org/grpc/authz` or custom interceptors relying on `info.FullMethod` or `grpc.Method(ctx)`.
2. Their security policy contains specific "deny" rules for canonical paths but allows other requests by default (a fallback "allow" rule).

The vulnerability is exploitable by an attacker who can send raw HTTP/2 frames with malformed `:path` headers directly to the gRPC server.

##### Patches
_Has the problem been patched? What versions should users upgrade to?_

Yes, the issue has been patched. The fix ensures that any request with a `:path` that does not start with a leading slash is immediately rejected with a `codes.Unimplemented` error, preventing it from reaching authorization interceptors or handlers with a non-canonical path string.

Users should upgrade to the following versions (or newer):
* **v1.79.3**
* The latest **master** branch.

It is recommended that all users employing path-based authorization (especially `grpc/authz`) upgrade as soon as the patch is available in a tagged release.

##### Workarounds
_Is there a way for users to fix or remediate the vulnerability without upgrading?_

While upgrading is the most secure and recommended path, users can mitigate the vulnerability using one of the following methods:

##### 1. Use a Validating Interceptor (Recommended Mitigation)
Add an "outermost" interceptor to your server that validates the path before any other authorization logic runs:

```go
func pathValidationInterceptor(ctx context.Context, req any, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (any, error) {
    if info.FullMethod == "" || info.FullMethod[0] != '/' {
        return nil, status.Errorf(codes.Unimplemented, "malformed method name")
    }
    return handler(ctx, req)
}

// Ensure this is the FIRST interceptor in your chain
s := grpc.NewServer(
    grpc.ChainUnaryInterceptor(pathValidationInterceptor, authzInterceptor),
)
```

##### 2. Infrastructure-Level Normalization
If your gRPC server is behind a reverse proxy or load balancer (such as Envoy, NGINX, or an L7 Cloud Load Balancer), ensure it is configured to enforce strict HTTP/2 compliance for pseudo-headers and reject or normalize requests where the `:path` header does not start with a leading slash.

##### 3. Policy Hardening
Switch to a "default deny" posture in your authorization policies (explicitly listing all allowed paths and denying everything else) to reduce the risk of bypasses via malformed inputs.

#### Severity
- CVSS Score: 9.1 / 10 (Critical)
- Vector String: `CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N`

#### References
- [https://github.com/grpc/grpc-go/security/advisories/GHSA-p77j-4mvh-x3m3](https://github.com/grpc/grpc-go/security/advisories/GHSA-p77j-4mvh-x3m3)
- [https://nvd.nist.gov/vuln/detail/CVE-2026-33186](https://nvd.nist.gov/vuln/detail/CVE-2026-33186)
- [https://github.com/grpc/grpc-go](https://github.com/grpc/grpc-go)

This data is provided by [OSV](https://osv.dev/vulnerability/GHSA-p77j-4mvh-x3m3) and the [GitHub Advisory Database](https://github.com/github/advisory-database) ([CC-BY 4.0](https://github.com/github/advisory-database/blob/main/LICENSE.md)).
</details>

---

### Authorization bypass in gRPC-Go via missing leading slash in :path in google.golang.org/grpc
[CVE-2026-33186](https://nvd.nist.gov/vuln/detail/CVE-2026-33186) / [GHSA-p77j-4mvh-x3m3](GHSA-p77j-4mvh-x3m3) / [GO-2026-4762](https://pkg.go.dev/vuln/GO-2026-4762)

<details>
<summary>More information</summary>

#### Details
Authorization bypass in gRPC-Go via missing leading slash in :path in google.golang.org/grpc

#### Severity
Unknown

#### References
- [https://github.com/grpc/grpc-go/security/advisories/GHSA-p77j-4mvh-x3m3](https://github.com/grpc/grpc-go/security/advisories/GHSA-p77j-4mvh-x3m3)

This data is provided by [OSV](https://osv.dev/vulnerability/GO-2026-4762) and the [Go Vulnerability Database](https://github.com/golang/vulndb) ([CC-BY 4.0](https://github.com/golang/vulndb#license)).
</details>

---

### Release Notes

<details>
<summary>grpc/grpc-go (google.golang.org/grpc)</summary>

### [`v1.79.3`](https://github.com/grpc/grpc-go/releases/tag/v1.79.3): Release 1.79.3

[Compare Source](grpc/grpc-go@v1.79.2...v1.79.3)

### Security

- server: fix an authorization bypass where malformed :path headers (missing the leading slash) could bypass path-based restricted "deny" rules in interceptors like `grpc/authz`. Any request with a non-canonical path is now immediately rejected with an `Unimplemented` error. ([#&#8203;8981](grpc/grpc-go#8981))

### [`v1.79.2`](https://github.com/grpc/grpc-go/releases/tag/v1.79.2): Release 1.79.2

[Compare Source](grpc/grpc-go@v1.79.1...v1.79.2)

### Bug Fixes

- stats: Prevent redundant error logging in health/ORCA producers by skipping stats/tracing processing when no stats handler is configured. ([#&#8203;8874](grpc/grpc-go#8874))

### [`v1.79.1`](https://github.com/grpc/grpc-go/releases/tag/v1.79.1): Release 1.79.1

[Compare Source](grpc/grpc-go@v1.79.0...v1.79.1)

### Bug Fixes

- grpc: Remove the `-dev` suffix from the User-Agent header. ([#&#8203;8902](grpc/grpc-go#8902))

### [`v1.79.0`](https://github.com/grpc/grpc-go/releases/tag/v1.79.0): Release 1.79.0

[Compare Source](grpc/grpc-go@v1.78.0...v1.79.0)

### API Changes

- mem: Add experimental API `SetDefaultBufferPool` to change the default buffer pool. ([#&#8203;8806](grpc/grpc-go#8806))
  - Special Thanks: [@&#8203;vanja-p](https://github.com/vanja-p)
- experimental/stats: Update `MetricsRecorder` to require embedding the new `UnimplementedMetricsRecorder` (a no-op struct) in all implementations for forward compatibility. ([#&#8203;8780](grpc/grpc-go#8780))

### Behavior Changes

- balancer/weightedtarget: Remove handling of `Addresses` and only handle `Endpoints` in resolver updates. ([#&#8203;8841](grpc/grpc-go#8841))

### New Features

- experimental/stats: Add support for asynchronous gauge metrics through the new `AsyncMetricReporter` and `RegisterAsyncReporter` APIs. ([#&#8203;8780](grpc/grpc-go#8780))
- pickfirst: Add support for weighted random shuffling of endpoints, as described in [gRFC A113](grpc/proposal#535).
  - This is enabled by default, and can be turned off using the environment variable `GRPC_EXPERIMENTAL_PF_WEIGHTED_SHUFFLING`. ([#&#8203;8864](grpc/grpc-go#8864))
- xds: Implement `:authority` rewriting, as specified in [gRFC A81](https://github.com/grpc/proposal/blob/master/A81-xds-authority-rewriting.md). ([#&#8203;8779](grpc/grpc-go#8779))
- balancer/randomsubsetting: Implement the `random_subsetting` LB policy, as specified in [gRFC A68](https://github.com/grpc/proposal/blob/master/A68-random-subsetting.md). ([#&#8203;8650](grpc/grpc-go#8650))
  - Special Thanks: [@&#8203;marek-szews](https://github.com/marek-szews)

### Bug Fixes

- credentials/tls: Fix a bug where the port was not stripped from the authority override before validation. ([#&#8203;8726](grpc/grpc-go#8726))
  - Special Thanks: [@&#8203;Atul1710](https://github.com/Atul1710)
- xds/priority: Fix a bug causing delayed failover to lower-priority clusters when a higher-priority cluster is stuck in `CONNECTING` state. ([#&#8203;8813](grpc/grpc-go#8813))
- health: Fix a bug where health checks failed for clients using legacy compression options (`WithDecompressor` or `RPCDecompressor`). ([#&#8203;8765](grpc/grpc-go#8765))
  - Special Thanks: [@&#8203;sanki92](https://github.com/sanki92)
- transport: Fix an issue where the HTTP/2 server could skip header size checks when terminating a stream early. ([#&#8203;8769](grpc/grpc-go#8769))
  - Special Thanks: [@&#8203;joybestourous](https://github.com/joybestourous)
- server: Propagate status detail headers, if available, when terminating a stream during request header processing. ([#&#8203;8754](grpc/grpc-go#8754))
  - Special Thanks: [@&#8203;joybestourous](https://github.com/joybestourous)

### Performance Improvements

- credentials/alts: Optimize read buffer alignment to reduce copies. ([#&#8203;8791](grpc/grpc-go#8791))
- mem: Optimize pooling and creation of `buffer` objects.  ([#&#8203;8784](grpc/grpc-go#8784))
- transport: Reduce slice re-allocations by reserving slice capacity. ([#&#8203;8797](grpc/grpc-go#8797))

### [`v1.78.0`](https://github.com/grpc/grpc-go/releases/tag/v1.78.0): Release 1.78.0

[Compare Source](grpc/grpc-go@v1.77.0...v1.78.0)

### Behavior Changes

- client: Align URL validation with Go 1.26+ to now reject target URLs with unbracketed colons in the hostname. ([#&#8203;8716](grpc/grpc-go#8716))
  - Special Thanks: [@&#8203;neild](https://github.com/neild)
- transport/client : Return status code `Unknown` on malformed grpc-status. ([#&#8203;8735](grpc/grpc-go#8735))
- - xds/resolver:
  - Drop previous route resources and report an error when no matching virtual host is found.
  - Only log LDS/RDS configuration errors following a successful update and retain the last valid resource to prevent transient failures. ([#&#8203;8711](grpc/grpc-go#8711))

### New Features

- stats/otel: Add backend service label to weighted round robin metrics as part of A89. ([#&#8203;8737](grpc/grpc-go#8737))
- stats/otel: Add subchannel metrics (without the disconnection reason) to eventually replace the pickfirst metrics. ([#&#8203;8738](grpc/grpc-go#8738))
- client: Wait for all pending goroutines to complete when closing a graceful switch balancer. ([#&#8203;8746](grpc/grpc-go#8746))
  - Special Thanks: [@&#8203;twz123](https://github.com/twz123)
- client: Add `experimental.AcceptCompressors` so callers can restrict the `grpc-accept-encoding` header advertised for a call. ([#&#8203;8718](grpc/grpc-go#8718))
  - Special Thanks: [@&#8203;iblancasa](https://github.com/iblancasa)

### Bug Fixes

- xds: Fix a bug in `StringMatcher` where regexes would match incorrectly when ignore\_case is set to true. ([#&#8203;8723](grpc/grpc-go#8723))
- client:
  - Change connectivity state to CONNECTING when creating the name resolver (as part of exiting IDLE).
  - Change connectivity state to TRANSIENT\_FAILURE if name resolver creation fails (as part of exiting IDLE).
  - Change connectivity state to IDLE after idle timeout expires even when current state is TRANSIENT\_FAILURE.
  - Fix a bug that resulted in `OnFinish` call option not being invoked for RPCs where stream creation failed. ([#&#8203;8710](grpc/grpc-go#8710))
- xdsclient: Fix a race in the xdsClient that could lead to resource-not-found errors. ([#&#8203;8627](grpc/grpc-go#8627))

### Performance Improvements

- mem: Round up to nearest 4KiB for pool allocations larger than 1MiB. ([#&#8203;8705](grpc/grpc-go#8705))
  - Special Thanks: [@&#8203;cjc25](https://github.com/cjc25)

### [`v1.77.0`](https://github.com/grpc/grpc-go/releases/tag/v1.77.0): Release 1.77.0

[Compare Source](grpc/grpc-go@v1.76.0...v1.77.0)

### API Changes

- mem: Replace the `Reader` interface with a struct for better performance and maintainability. ([#&#8203;8669](grpc/grpc-go#8669))

### Behavior Changes

- balancer/pickfirst: Remove support for the old `pick_first` LB policy via the environment variable `GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST=false`. The new `pick_first` has been the default since `v1.71.0`. ([#&#8203;8672](grpc/grpc-go#8672))

### Bug Fixes

- xdsclient: Fix a race condition in the ADS stream implementation that could result in `resource-not-found` errors, causing the gRPC client channel to move to `TransientFailure`. ([#&#8203;8605](grpc/grpc-go#8605))
- client: Ignore HTTP status header for gRPC streams. ([#&#8203;8548](grpc/grpc-go#8548))
- client: Set a read deadline when closing a transport to prevent it from blocking indefinitely on a broken connection. ([#&#8203;8534](grpc/grpc-go#8534))
  - Special Thanks: [@&#8203;jgold2-stripe](https://github.com/jgold2-stripe)
- client: Fix a bug where default port 443 was not automatically added to addresses without a specified port when sent to a proxy.
  - Setting environment variable `GRPC_EXPERIMENTAL_ENABLE_DEFAULT_PORT_FOR_PROXY_TARGET=false` disables this change; please file a bug if any problems are encountered as we will remove this option soon. ([#&#8203;8613](grpc/grpc-go#8613))
- balancer/pickfirst: Fix a bug where duplicate addresses were not being ignored as intended. ([#&#8203;8611](grpc/grpc-go#8611))
- server: Fix a bug that caused overcounting of channelz metrics for successful and failed streams. ([#&#8203;8573](grpc/grpc-go#8573))
  - Special Thanks: [@&#8203;hugehoo](https://github.com/hugehoo)
- balancer/pickfirst: When configured, shuffle addresses in resolver updates that lack endpoints. Since gRPC automatically adds endpoints to resolver updates, this bug only affects custom LB policies that delegate to `pick_first` but don't set endpoints. ([#&#8203;8610](grpc/grpc-go#8610))
- mem: Clear large buffers before re-using. ([#&#8203;8670](grpc/grpc-go#8670))

### Performance Improvements

- transport: Reduce heap allocations to reduce time spent in garbage collection. ([#&#8203;8624](grpc/grpc-go#8624), [#&#8203;8630](grpc/grpc-go#8630), [#&#8203;8639](grpc/grpc-go#8639), [#&#8203;8668](grpc/grpc-go#8668))
- transport: Avoid copies when reading and writing Data frames. ([#&#8203;8657](grpc/grpc-go#8657), [#&#8203;8667](grpc/grpc-go#8667))
- mem: Avoid clearing newly allocated buffers. ([#&#8203;8670](grpc/grpc-go#8670))

### New Features

- outlierdetection: Add metrics specified in [gRFC A91](https://github.com/grpc/proposal/blob/master/A91-outlier-detection-metrics.md). ([#&#8203;8644](grpc/grpc-go#8644))
  - Special Thanks: [@&#8203;davinci26](https://github.com/davinci26), [@&#8203;PardhuKonakanchi](https://github.com/PardhuKonakanchi)
- stats/opentelemetry: Add support for optional label `grpc.lb.backend_service` in per-call metrics ([#&#8203;8637](grpc/grpc-go#8637))
- xds: Add support for JWT Call Credentials as specified in [gRFC A97](https://github.com/grpc/proposal/blob/master/A97-xds-jwt-call-creds.md). Set environment variable `GRPC_EXPERIMENTAL_XDS_BOOTSTRAP_CALL_CREDS=true` to enable this feature. ([#&#8203;8536](grpc/grpc-go#8536))
  - Special Thanks: [@&#8203;dimpavloff](https://github.com/dimpavloff)
- experimental/stats: Add support for up/down counters. ([#&#8203;8581](grpc/grpc-go#8581))

### [`v1.76.0`](https://github.com/grpc/grpc-go/releases/tag/v1.76.0): Release 1.76.0

[Compare Source](grpc/grpc-go@v1.75.1...v1.76.0)

### Dependencies

- Minimum supported Go version is now 1.24 ([#&#8203;8509](grpc/grpc-go#8509))
  - Special Thanks: [@&#8203;kevinGC](https://github.com/kevinGC)

### Bug Fixes

- client: Return status `INTERNAL` when a server sends zero response messages for a unary or client-streaming RPC. ([#&#8203;8523](grpc/grpc-go#8523))
- client: Fail RPCs with status `INTERNAL` instead of `UNKNOWN` upon receiving http headers with status 1xx and  `END_STREAM` flag set. ([#&#8203;8518](grpc/grpc-go#8518))
  - Special Thanks: [@&#8203;vinothkumarr227](https://github.com/vinothkumarr227)
- pick\_first: Fix race condition that could cause pick\_first to get stuck in `IDLE` state on backend address change. ([#&#8203;8615](grpc/grpc-go#8615))

### New Features

- credentials: Add `credentials/jwt` package providing file-based JWT PerRPCCredentials (A97). ([#&#8203;8431](grpc/grpc-go#8431))
  - Special Thanks: [@&#8203;dimpavloff](https://github.com/dimpavloff)

### Performance Improvements

- client: Improve HTTP/2 header size estimate to reduce re-allocations. ([#&#8203;8547](grpc/grpc-go#8547))
- encoding/proto: Avoid redundant message size calculation when marshaling. ([#&#8203;8569](grpc/grpc-go#8569))
  - Special Thanks: [@&#8203;rs-unity](https://github.com/rs-unity)

### [`v1.75.1`](https://github.com/grpc/grpc-go/releases/tag/v1.75.1): Release 1.75.1

[Compare Source](grpc/grpc-go@v1.75.0...v1.75.1)

### Bug Fixes

- transport: Fix a data race while copying headers for stats handlers in the std lib http2 server transport. ([#&#8203;8519](grpc/grpc-go#8519))
- xdsclient:
  - Fix a data race caused while reporting load to LRS. ([#&#8203;8483](grpc/grpc-go#8483))
  - Fix regression preventing empty node IDs when creating an LRS client. ([#&#8203;8483](grpc/grpc-go#8483))
- server: Fix a regression preventing streams from being cancelled or timed out when blocked on flow control. ([#&#8203;8528](grpc/grpc-go#8528))

</details>

---

### Configuration

📅 **Schedule**: (UTC)

- Branch creation
  - ""
- Automerge
  - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`)

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Mend Renovate](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xOTUuMSIsInVwZGF0ZWRJblZlciI6IjQzLjE5NS4xIiwidGFyZ2V0QnJhbmNoIjoiZm9yZ2VqbyIsImxhYmVscyI6WyJkZXBlbmRlbmN5LXVwZ3JhZGUiLCJ0ZXN0L25vdC1uZWVkZWQiXX0=-->

Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/12794
Reviewed-by: Mathieu Fenniak <mfenniak@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a test for the initial channel state while waiting for the first name resolver update

3 participants