Skip to content

rls: update rls cache metrics to use async gauge framework#8808

Merged
mbissa merged 65 commits into
grpc:masterfrom
mbissa:move-rls-cache-metrics-to-async-framework
Mar 11, 2026
Merged

rls: update rls cache metrics to use async gauge framework#8808
mbissa merged 65 commits into
grpc:masterfrom
mbissa:move-rls-cache-metrics-to-async-framework

Conversation

@mbissa

@mbissa mbissa commented Jan 7, 2026

Copy link
Copy Markdown
Contributor

This PR changes the way rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to use async gauge framework instead of synchronoush capturing.
This also does cleanup to remove the NoopMetricsRecorder struct along with addressing some old nits.

RELEASE NOTES:

  • rls: Change rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to be recorded asynchronously (i.e the metric will be recorded once per collection cycle, rather than every time its value changes) going forward.

@mbissa mbissa added this to the 1.79 Release milestone Jan 7, 2026
@mbissa mbissa requested a review from arjan-bal January 7, 2026 02:27
@mbissa mbissa self-assigned this Jan 7, 2026
@mbissa mbissa added the Type: Feature New features or improvements in behavior label Jan 7, 2026
@codecov

codecov Bot commented Jan 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.86486% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.25%. Comparing base (fd53961) to head (c12db10).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
internal/testutils/stats/test_metrics_recorder.go 0.00% 11 Missing ⚠️
balancer/rls/balancer.go 88.88% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8808      +/-   ##
==========================================
- Coverage   83.26%   83.25%   -0.01%     
==========================================
  Files         410      410              
  Lines       32576    32598      +22     
==========================================
+ Hits        27123    27139      +16     
- Misses       4062     4066       +4     
- Partials     1391     1393       +2     
Files with missing lines Coverage Δ
balancer/rls/cache.go 88.39% <100.00%> (-0.60%) ⬇️
balancer/rls/balancer.go 86.55% <88.88%> (+0.09%) ⬆️
internal/testutils/stats/test_metrics_recorder.go 77.85% <0.00%> (-6.21%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can clarify the meaning of asynchronous in the release notes by adding something like: "The metric is recorded once per collection cycle, rather than every time its value changes."

Comment thread balancer/rls/balancer.go Outdated
Comment thread balancer/rls/balancer.go Outdated
Comment thread balancer/rls/balancer.go Outdated
Comment thread balancer/rls/balancer.go Outdated
Comment thread balancer/rls/cache_test.go Outdated
Comment thread balancer/rls/cache_test.go Outdated
Comment thread balancer/rls/cache_test.go Outdated
@arjan-bal arjan-bal removed their assignment Jan 7, 2026
@arjan-bal arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label Jan 7, 2026
Comment thread balancer/rls/balancer.go Outdated
eshitachandwani and others added 17 commits February 16, 2026 06:36
…grpc#8793)

This is Part of A74 changes.

This PR add a new `IsDynamic` field to the CDS balancer LB Config. Also
sets it to true for RLS cluster specifier plugin.

This PR also fixes the test in rls_test.go which earlier was returning
if the test has `wantError=true` effectively not testing the cases after
that.

This will be used to dynamically start watch for cluster specifier
plugin clusters from cds balancer.

RELEASE NOTES: None
The change sets an initial estimate for the `mem.BufferSlice` capacity
while reading messages to avoid reallocations for messages larger than
16KB.

## Benchmarks
```sh
$ go run benchmark/benchresult/main.go streaming-before streaming-after                
streaming-networkMode_Local-bufConn_true-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentC
alls_120-reqSize_16500B-respSize_16500B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteB
ufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-shar
edWriteBuffer_false
               Title       Before        After Percentage
            TotalOps      5040044      5024089    -0.32%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     75499.82     75446.71    -0.07%
           Allocs/op        26.24        24.22    -7.62%
             ReqT/op 11088096800.00 11052995800.00    -0.32%
            RespT/op 11088096800.00 11052995800.00    -0.32%
            50th-Lat   1.215848ms    1.22471ms     0.73%
            90th-Lat   2.258081ms    2.25736ms    -0.03%
            99th-Lat    2.78024ms   2.756574ms    -0.85%
             Avg-Lat   1.427389ms   1.431989ms     0.32%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.79.0-dev   1.79.0-dev
```

RELEASE NOTES:
* transport: Reduce slice re-allocations by reserving slice capacity.
grpc#7378 added an internal API to propagate the locality ID of each address
used by a subchannel. This was required because the legacy pick_first
implementation created subchannels with multiple addresses, each with a
potentially different locality. Since the legacy implementation has been
removed, and the new pick_first creates subchannels with only a single
address, this workaround is now redundant. This PR also make the
`UpdateAddresses` method of `clusterImpl` balancer a no-op since none of
the leaf policies use it anymore.

RELEASE NOTES: N/A
This change migrates usages of ResolverState.Addresses to
ResolverState.Endpoints. Prior to Dualstack support, xDS LB policies
relied exclusively on the `Addresses` field. As part of the Dualstack
initiative, these policies were updated to support both `Endpoints` and
`Addresses`. This PR is the final step to deprecate the `Addresses`
propagation logic in favor of relying solely on `Endpoints`.

RELEASE NOTES: N/A
When a connection attempt is canceled after the transport is created,
ensure that it is closed inline. This prevents untracked goroutines from
being left behind.

See:

* grpc#8655
* grpc#8666 (comment)

RELEASE NOTES:
* When canceling a connection attempt, closing the transport won't use
an untracked goroutine anymore.

Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
… CONNECTING to CONNECTING (grpc#8813)

Fixes grpc#8516

This PR ensures that the init timer is not restarted when a child policy
transitions from CONNECTING to CONNECTING. This ensures that we will
only ever wait for `DefaultPriorityInitTimeout` which is set to a value
of `10s` for a given child to become `Ready` before we attempt a
failover to the next highest priority child.

RELEASE NOTES:
- xds/priority: Fixed a bug causing delayed failover to lower-priority
clusters when a higher-priority cluster is stuck in the CONNECTING
state.
See:

* grpc#8666 (comment)

RELEASE NOTES: none

Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
stream: initialize decompressor for health check streams

Fixes grpc#8162

Health checks fail when using legacy compression options
(WithDecompressor on client or RPCDecompressor on server). The health
check stream created via newNonRetryClientStream does not initialize the
decompressorV0 field with the legacy decompressor from dial options,
causing an Internal error when attempting to decompress server
responses.

This fix initializes the decompressorV0 field in addrConnStream with
ac.cc.dopts.dc, matching the behavior of the regular clientStream path
(line 492) to ensure health check streams can properly decompress
responses when legacy compression options are configured.

RELEASE NOTES:
* health: Fixed health checks failing when using legacy compression
options WithDecompressor or RPCDecompressor.
…d correct resources (grpc#8794)

This PR fixes the `TestErrorFromParentLB_ResourceNotFound` to
reconfigure the same CDS and EDS resources after they have been removed
instead of configuring all the new resources.

Also removes unused `defaultTestShortTimeout` from
[internal/xds/balancer/clustermanager/clustermanager_test.go](https://github.com/grpc/grpc-go/blob/900ffa948c031ddd50a667750f7cba8f01c66c4d/internal/xds/balancer/clustermanager/clustermanager_test.go#L51)
RELEASE NOTES: None
…orting stream (grpc#8769)

Modifies `earlyAbortStreamHandler` to include check the header list size
when early aborting, and returns a RST_STREAM if the max size is
exceeded.

Fixes grpc#8766

RELEASE NOTES:
* transport: http2 server now validates header list size when early
aborting stream
Use new-style atomic APIs instead of the old ones in the WRR scheduler
type.

The changes made in this PR improve the code in the following ways:

- Ergonomics: Method-based API vs function-based, no pointer management
needed
- Safety: Type safety prevents mixing atomic/non-atomic operations,
eliminates pointer errors
- Clarity: The atomic.Uint32 type makes atomic intent explicit from
declaration

RELEASE NOTES: none
[Splitting a
`buffer`](https://github.com/grpc/grpc-go/blob/40466769682557e7179b8c74ba3820cc78d49b4b/mem/buffers.go#L172-L187)
results in fetching a new `buffer` object from a `sync.Pool`. The
`buffer` object is returned back to the pool only once [the shared ref
count falls to
0](https://github.com/grpc/grpc-go/blob/40466769682557e7179b8c74ba3820cc78d49b4b/mem/buffers.go#L152-L155).
As a result, only one of the `buffer` objects is returned back to the
pool for re-use. The "leaked" buffer objects may cause noticeable
allocations when buffers are split more frequently. I noticed this when
[attempting to remove a buffer
copy](https://github.com/grpc/grpc-go/compare/master...arjan-bal:zero-copy-buf-reader?expand=1)
by replacing the bufio.Reader.

## Solution
This PR introduces a root-owner model for the underlying `*[]byte`
within `buffer` objects. The root object manages the slice's lifecycle,
returning it to the pool only when its reference count reaches zero.

When a `buffer` is split, the new `buffer` is treated as a child,
incrementing the ref counts for both itself and the root. Once a child’s
ref count hits zero, it returns itself to the pool and decrements the
root’s count.

Additionally, this PR removes the `sync.Pool` used for `*atomic.Int32`
by embedding `atomic.Int32` as a value field within the `buffer` struct.
By eliminating the second pool and the associated pointer indirection,
we reduce allocation overhead and improve cache locality during buffer
lifecycle events.

## Benchmarks

A micro-benchmark showing the buffer object leak:
```go
func BenchmarkSplit(b *testing.B) {
	pool := mem.DefaultBufferPool()
	b.Run("split", func(b *testing.B) {
		for b.Loop() {
			size := 1 << 15 // 32 KB
			slice := pool.Get(size)
			buf := mem.NewBuffer(slice, pool)
			left, right := mem.SplitUnsafe(buf, size/2)
			left.Free()
			right.Free()
		}
	})
	b.Run("no-split", func(b *testing.B) {
		for b.Loop() {
			size := 1 << 15 // 32 KB
			slice := pool.Get(size)
			buf := mem.NewBuffer(slice, pool)
			buf.Free()
		}
	})
}
```
Result on master vs this PR.
```sh
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/mem
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
                  │   old.txt   │               new.txt               │
                  │   sec/op    │   sec/op     vs base                │
Split/split-48      418.2n ± 0%   263.9n ± 1%  -36.89% (p=0.000 n=10)
Split/no-split-48   221.1n ± 1%   208.5n ± 0%   -5.70% (p=0.000 n=10)
geomean             304.1n        234.6n       -22.86%

                  │   old.txt    │                 new.txt                 │
                  │     B/op     │    B/op     vs base                     │
Split/split-48      64.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)
Split/no-split-48   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
geomean                        ²               ?                       ² ³
¹ all samples are equal
² summaries must be >0 to compute geomean
³ ratios must be >0 to compute geomean

                  │   old.txt    │                 new.txt                 │
                  │  allocs/op   │ allocs/op   vs base                     │
Split/split-48      1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
Split/no-split-48   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
geomean                        ²               ?                       ² ³
¹ all samples are equal
² summaries must be >0 to compute geomean
³ ratios must be >0 to compute geomean
```

The effect on local gRPC benchmarks is negligible since the
`SplitUnsafe` function isn't called very frequently.
```sh
$ go run benchmark/benchresult/main.go unary-before unary-after       
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurr
entCalls_120-reqSize_16000B-respSize_16000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-c
lientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBuff
erPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps      2985694      3024364     1.30%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     74784.94     74784.99     0.00%
           Allocs/op       133.67       133.89     0.00%
             ReqT/op 6369480533.33 6451976533.33     1.30%
            RespT/op 6369480533.33 6451976533.33     1.30%
            50th-Lat   2.410033ms    2.40116ms    -0.37%
            90th-Lat   3.145118ms   3.081771ms    -2.01%
            99th-Lat   3.563055ms   3.629663ms     1.87%
             Avg-Lat   2.410529ms   2.379513ms    -1.29%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.78.0-dev   1.78.0-dev
```

RELEASE NOTES:
* mem: Improve pooling of `buffer` objects on using `SplitUnsafe`.
Following the dual-stack changes that established `pick_first` as the
universal leaf policy and the subsequent removal of the legacy
`pick_first` implementation, `ClientConn.UpdateAddresses` is no longer
utilized within the xDS LB policy tree. When addresses change, old
`SubConns` are now shut down and new ones are created.

This PR makes `UpdateAddresses` a no-op and removes the corresponding
test. Additionally, it ensures the `subConnWrapper` is removed from the
ejection/unejection map when a SubConn closes to prevent memory leaks.


RELEASE NOTES: N/A
…ilure (grpc#8805)

Fixes: grpc#8801

There were 2 problems here : 
1. The actual DNS lookup calls return different resolved addresses for
"localhost" on different machines. Changed the code to replace the DNS
resolver with a manual resolver in test to mock the DNS resolver.

2. In the case where the tests send a EDS or cluster error , the xds
management server keeps sending resource error continuously which in
turn triggers updates from dependency manager which pushes the [update
to a
channel.](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L97)
To mitigate this situation we had a done channel , that we [close when
the test
ends](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1273)
and the update function
[check](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L95)
for the done channel , if it is close, it returns. In
[TestAggregateCLusterChildError](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1170)
test , (and other tests too) , the channel was being closed at the end
of the test , which meant it will close only when the test passes. And
if the tests fails, the done channel was not getting closed , which made
the update function to block.
Changed the close(done channel) to be a defer function , but this should
be the first thing that happens after the test is done , before the
dependency manager closes, so this is the last defer called in the test.

Also fixes `TestRouteConfigResourceError` flaking due to a similar
error.

RELEASE NOTES: None
Addresses grpc#1824

This PR adds a test for the case where the client specifies a codec that
is unsupported on the server. In a previous PR, we had added a warning
for this scenario. In a subsequent PR, we will add an environment
variable to gate the new behavior, where the server will not default to
using the `proto` codec for unsupported codecs, but instead will fail
the request.

RELEASE NOTES: None
This PR remove the `grpchttp2` package since we no longer want to
implement a custom http2 framer in grpc.

We originally planned a custom HTTP/2 framer to get rid of internal
copies. However, we have updated gRPC to control data frame I/O
([grpc#8667](grpc#8667),
[grpc#8657](grpc#8657)) and will have
changes that removes the bufio.Reader copy. This removes the need for a
custom framer.

RELEASE NOTES: None
arjan-bal and others added 9 commits February 16, 2026 06:36
…rpc#8818)

This adds e2e style tests to verify that when JWT call credentials
(which require transport security) are used with an insecure transport,
the RPC fails with a meaningful error as expected per gRFC A97.

Fixes grpc#8635

RELEASE NOTES: none

---------

Signed-off-by: iamrajiv <rajivperfect007@gmail.com>
If there was an error, e.g. while decompressing the data, we still want
to return all buffers to the pool.

RELEASE NOTES: None
Fixes: grpc#8883

This change doubles the timeout for `EDS_MultipleLocalities` as the test
sends around 5000 RPCs and may time out on slow machines.

RELEASE NOTES: N/A
…ues (grpc#8867)

It's nicer than iterating over the keys and looking up their values.

RELEASE NOTES:
* resolver: Add `All()` methods to `AddressMap` and `EndpointMap` that
return iterators for efficient key-value pair traversal.
While working on xDS HTTP filter state retention changes required for
A83 and A86 in grpc#8745, I realized
that the current implementation of HTTP filters on the server-side does
not lend itself well to the changes we want to make to support filter
state retention. The crux of the problem in the current implementation
is as follows:
- The `xdsresource` package builds a `FilterChainManager` type that does
the following:
- Keeps track of the list of Route Configuration resources to be
requested based on the HTTP filters configured on the Listener resource
- Updates the interceptors on the different filter chains based on Route
Configuration resource updates (that contain filter config overrides)
- Maintaining runtime state in a type that is supposed to represent a
configuration blob seems to have been a wrong implementation choice
- Everytime a Listener resource update is received, the `xdsresource`
package builds a brand new `FilterChainManager`
- This means that we *cannot* retain filter state across resource
updates

This PR makes the following changes:
- Changes the layout of the `ListenerUpdate` struct to clearly separate
client-side and server-side fields
- Also groups information that is shared across client and server
listener types into a logical struct
- The `ListenerUpdate` will *only* contain information directly sourced
from the listener proto
- The xDS server implementation will build the runtime representation of
the filter chains using the configuration update provided by the
`xdsresource` package
- This would move the responsibility of updating the interceptors on the
filter chain to the xDS server, which is the correct place to be doing
such a thing
- This would make it possible for the xDS server implementation to
retain filter state across listener resource updates

RELEASE NOTES: none
This PR increases the expiry of the CAs from 1 year to 10 years and
re-generates the certificates.

RELEASE NOTES: N/A
…rpc#8874)

This resolves an issue where below log statement keeps printing
repeatedly:
"ERROR: [otel-plugin] ctx passed into client side stats handler metrics
event handling has no client attempt data present".
When new stream is created for health/orca producers, stats and tracing
is not setup. However, this fact is ignored during RPC and an error logs
is printed to denote that stats cannot be handled. We will enable stream
to have its own reference to the stats handler and only process per RPC
implementation when it is present (like in case of regular data
streams).

Internal issue: b/385685802

RELEASE NOTES:
* stats: only process RPC stats/tracing in health and ORCA producers if
a handler is configured, preventing unnecessary error logging
@mbissa

mbissa commented Feb 16, 2026

Copy link
Copy Markdown
Contributor Author

merging master (rebasing due to staleness)

@mbissa mbissa force-pushed the move-rls-cache-metrics-to-async-framework branch from aa21c76 to b688f86 Compare February 16, 2026 08:57
@mbissa mbissa assigned arjan-bal and unassigned mbissa Feb 16, 2026
Comment thread balancer/rls/balancer.go Outdated
Comment thread balancer/rls/cache_test.go Outdated
Comment on lines +266 to +267
tmr.ClearMetrics()
dc.reportMetrics(tmr, dc.currentSize, int64(len(dc.entries)), dc.rlsServerTarget)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since reportMetrics is just a pass-through, this unit test adds little value. Let's move it to the RLS balancer tests: trigger actions -> verify metrics. If making that deterministic is too complex, let me know.

There are existing tests in metrics_test.go that may have some coverage already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, removed the test here.
The tests in metrics_test.go verify the grpc.lb.rls.cache_entries and grpc.lb.rls.cache_size metrics in the context of the full balancer lifecycle.

@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Feb 17, 2026
@mbissa mbissa modified the milestones: 1.80 Release, 1.81 Release Mar 6, 2026
@mbissa mbissa assigned arjan-bal and unassigned mbissa Mar 10, 2026

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Mar 11, 2026
@arjan-bal

Copy link
Copy Markdown
Contributor

Please merge the master branch to unblock go1.26 CI workflows that were recently added.

@mbissa mbissa merged commit 98573cb into grpc:master Mar 11, 2026
14 checks passed
@ejona86

ejona86 commented Mar 11, 2026

Copy link
Copy Markdown
Member

It looks like the commit from this PR was busted, with all the Co-authored-by.
98573cb

Looking like merging/rebasing went wrong and someone didn't check when merging.

@mbissa

mbissa commented Mar 12, 2026

Copy link
Copy Markdown
Contributor Author

It looks like the commit from this PR was busted, with all the Co-authored-by. 98573cb

Looking like merging/rebasing went wrong and someone didn't check when merging.

There were many merge conflicts due to this PR not seeing activity in last 30-45 days. I thought merge commits are supposed to have this. I will ensure that this doesn't happen next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.