rls: update rls cache metrics to use async gauge framework by mbissa · Pull Request #8808 · grpc/grpc-go

mbissa · 2026-01-07T02:27:43Z

This PR changes the way rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to use async gauge framework instead of synchronoush capturing.
This also does cleanup to remove the NoopMetricsRecorder struct along with addressing some old nits.

RELEASE NOTES:

rls: Change rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to be recorded asynchronously (i.e the metric will be recorded once per collection cycle, rather than every time its value changes) going forward.

codecov · 2026-01-07T02:31:14Z

Codecov Report

❌ Patch coverage is 64.86486% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.25%. Comparing base (fd53961) to head (c12db10).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/testutils/stats/test_metrics_recorder.go	0.00%	11 Missing ⚠️
balancer/rls/balancer.go	88.88%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8808      +/-   ##
==========================================
- Coverage   83.26%   83.25%   -0.01%     
==========================================
  Files         410      410              
  Lines       32576    32598      +22     
==========================================
+ Hits        27123    27139      +16     
- Misses       4062     4066       +4     
- Partials     1391     1393       +2

Files with missing lines	Coverage Δ
balancer/rls/cache.go	`88.39% <100.00%> (-0.60%)`	⬇️
balancer/rls/balancer.go	`86.55% <88.88%> (+0.09%)`	⬆️
internal/testutils/stats/test_metrics_recorder.go	`77.85% <0.00%> (-6.21%)`	⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arjan-bal

I think we can clarify the meaning of asynchronous in the release notes by adding something like: "The metric is recorded once per collection cycle, rather than every time its value changes."

…grpc#8793) This is Part of A74 changes. This PR add a new `IsDynamic` field to the CDS balancer LB Config. Also sets it to true for RLS cluster specifier plugin. This PR also fixes the test in rls_test.go which earlier was returning if the test has `wantError=true` effectively not testing the cases after that. This will be used to dynamically start watch for cluster specifier plugin clusters from cds balancer. RELEASE NOTES: None

The change sets an initial estimate for the `mem.BufferSlice` capacity while reading messages to avoid reallocations for messages larger than 16KB. ## Benchmarks ```sh $ go run benchmark/benchresult/main.go streaming-before streaming-after streaming-networkMode_Local-bufConn_true-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentC alls_120-reqSize_16500B-respSize_16500B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteB ufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-shar edWriteBuffer_false Title Before After Percentage TotalOps 5040044 5024089 -0.32% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 75499.82 75446.71 -0.07% Allocs/op 26.24 24.22 -7.62% ReqT/op 11088096800.00 11052995800.00 -0.32% RespT/op 11088096800.00 11052995800.00 -0.32% 50th-Lat 1.215848ms 1.22471ms 0.73% 90th-Lat 2.258081ms 2.25736ms -0.03% 99th-Lat 2.78024ms 2.756574ms -0.85% Avg-Lat 1.427389ms 1.431989ms 0.32% GoVersion go1.24.8 go1.24.8 GrpcVersion 1.79.0-dev 1.79.0-dev ``` RELEASE NOTES: * transport: Reduce slice re-allocations by reserving slice capacity.

grpc#7378 added an internal API to propagate the locality ID of each address used by a subchannel. This was required because the legacy pick_first implementation created subchannels with multiple addresses, each with a potentially different locality. Since the legacy implementation has been removed, and the new pick_first creates subchannels with only a single address, this workaround is now redundant. This PR also make the `UpdateAddresses` method of `clusterImpl` balancer a no-op since none of the leaf policies use it anymore. RELEASE NOTES: N/A

This change migrates usages of ResolverState.Addresses to ResolverState.Endpoints. Prior to Dualstack support, xDS LB policies relied exclusively on the `Addresses` field. As part of the Dualstack initiative, these policies were updated to support both `Endpoints` and `Addresses`. This PR is the final step to deprecate the `Addresses` propagation logic in favor of relying solely on `Endpoints`. RELEASE NOTES: N/A

When a connection attempt is canceled after the transport is created, ensure that it is closed inline. This prevents untracked goroutines from being left behind. See: * grpc#8655 * grpc#8666 (comment) RELEASE NOTES: * When canceling a connection attempt, closing the transport won't use an untracked goroutine anymore. Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>

… CONNECTING to CONNECTING (grpc#8813) Fixes grpc#8516 This PR ensures that the init timer is not restarted when a child policy transitions from CONNECTING to CONNECTING. This ensures that we will only ever wait for `DefaultPriorityInitTimeout` which is set to a value of `10s` for a given child to become `Ready` before we attempt a failover to the next highest priority child. RELEASE NOTES: - xds/priority: Fixed a bug causing delayed failover to lower-priority clusters when a higher-priority cluster is stuck in the CONNECTING state.

See: * grpc#8666 (comment) RELEASE NOTES: none Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>

stream: initialize decompressor for health check streams Fixes grpc#8162 Health checks fail when using legacy compression options (WithDecompressor on client or RPCDecompressor on server). The health check stream created via newNonRetryClientStream does not initialize the decompressorV0 field with the legacy decompressor from dial options, causing an Internal error when attempting to decompress server responses. This fix initializes the decompressorV0 field in addrConnStream with ac.cc.dopts.dc, matching the behavior of the regular clientStream path (line 492) to ensure health check streams can properly decompress responses when legacy compression options are configured. RELEASE NOTES: * health: Fixed health checks failing when using legacy compression options WithDecompressor or RPCDecompressor.

…d correct resources (grpc#8794) This PR fixes the `TestErrorFromParentLB_ResourceNotFound` to reconfigure the same CDS and EDS resources after they have been removed instead of configuring all the new resources. Also removes unused `defaultTestShortTimeout` from [internal/xds/balancer/clustermanager/clustermanager_test.go](https://github.com/grpc/grpc-go/blob/900ffa948c031ddd50a667750f7cba8f01c66c4d/internal/xds/balancer/clustermanager/clustermanager_test.go#L51) RELEASE NOTES: None

…orting stream (grpc#8769) Modifies `earlyAbortStreamHandler` to include check the header list size when early aborting, and returns a RST_STREAM if the max size is exceeded. Fixes grpc#8766 RELEASE NOTES: * transport: http2 server now validates header list size when early aborting stream

Use new-style atomic APIs instead of the old ones in the WRR scheduler type. The changes made in this PR improve the code in the following ways: - Ergonomics: Method-based API vs function-based, no pointer management needed - Safety: Type safety prevents mixing atomic/non-atomic operations, eliminates pointer errors - Clarity: The atomic.Uint32 type makes atomic intent explicit from declaration RELEASE NOTES: none

[Splitting a `buffer`](https://github.com/grpc/grpc-go/blob/40466769682557e7179b8c74ba3820cc78d49b4b/mem/buffers.go#L172-L187) results in fetching a new `buffer` object from a `sync.Pool`. The `buffer` object is returned back to the pool only once [the shared ref count falls to 0](https://github.com/grpc/grpc-go/blob/40466769682557e7179b8c74ba3820cc78d49b4b/mem/buffers.go#L152-L155). As a result, only one of the `buffer` objects is returned back to the pool for re-use. The "leaked" buffer objects may cause noticeable allocations when buffers are split more frequently. I noticed this when [attempting to remove a buffer copy](https://github.com/grpc/grpc-go/compare/master...arjan-bal:zero-copy-buf-reader?expand=1) by replacing the bufio.Reader. ## Solution This PR introduces a root-owner model for the underlying `*[]byte` within `buffer` objects. The root object manages the slice's lifecycle, returning it to the pool only when its reference count reaches zero. When a `buffer` is split, the new `buffer` is treated as a child, incrementing the ref counts for both itself and the root. Once a child’s ref count hits zero, it returns itself to the pool and decrements the root’s count. Additionally, this PR removes the `sync.Pool` used for `*atomic.Int32` by embedding `atomic.Int32` as a value field within the `buffer` struct. By eliminating the second pool and the associated pointer indirection, we reduce allocation overhead and improve cache locality during buffer lifecycle events. ## Benchmarks A micro-benchmark showing the buffer object leak: ```go func BenchmarkSplit(b *testing.B) { pool := mem.DefaultBufferPool() b.Run("split", func(b *testing.B) { for b.Loop() { size := 1 << 15 // 32 KB slice := pool.Get(size) buf := mem.NewBuffer(slice, pool) left, right := mem.SplitUnsafe(buf, size/2) left.Free() right.Free() } }) b.Run("no-split", func(b *testing.B) { for b.Loop() { size := 1 << 15 // 32 KB slice := pool.Get(size) buf := mem.NewBuffer(slice, pool) buf.Free() } }) } ``` Result on master vs this PR. ```sh goos: linux goarch: amd64 pkg: google.golang.org/grpc/mem cpu: Intel(R) Xeon(R) CPU @ 2.60GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Split/split-48 418.2n ± 0% 263.9n ± 1% -36.89% (p=0.000 n=10) Split/no-split-48 221.1n ± 1% 208.5n ± 0% -5.70% (p=0.000 n=10) geomean 304.1n 234.6n -22.86% │ old.txt │ new.txt │ │ B/op │ B/op vs base │ Split/split-48 64.00 ± 0% 0.00 ± 0% -100.00% (p=0.000 n=10) Split/no-split-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² ? ² ³ ¹ all samples are equal ² summaries must be >0 to compute geomean ³ ratios must be >0 to compute geomean │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ Split/split-48 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) Split/no-split-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² ? ² ³ ¹ all samples are equal ² summaries must be >0 to compute geomean ³ ratios must be >0 to compute geomean ``` The effect on local gRPC benchmarks is negligible since the `SplitUnsafe` function isn't called very frequently. ```sh $ go run benchmark/benchresult/main.go unary-before unary-after unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurr entCalls_120-reqSize_16000B-respSize_16000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-c lientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBuff erPool_simple-sharedWriteBuffer_false Title Before After Percentage TotalOps 2985694 3024364 1.30% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 74784.94 74784.99 0.00% Allocs/op 133.67 133.89 0.00% ReqT/op 6369480533.33 6451976533.33 1.30% RespT/op 6369480533.33 6451976533.33 1.30% 50th-Lat 2.410033ms 2.40116ms -0.37% 90th-Lat 3.145118ms 3.081771ms -2.01% 99th-Lat 3.563055ms 3.629663ms 1.87% Avg-Lat 2.410529ms 2.379513ms -1.29% GoVersion go1.24.8 go1.24.8 GrpcVersion 1.78.0-dev 1.78.0-dev ``` RELEASE NOTES: * mem: Improve pooling of `buffer` objects on using `SplitUnsafe`.

Following the dual-stack changes that established `pick_first` as the universal leaf policy and the subsequent removal of the legacy `pick_first` implementation, `ClientConn.UpdateAddresses` is no longer utilized within the xDS LB policy tree. When addresses change, old `SubConns` are now shut down and new ones are created. This PR makes `UpdateAddresses` a no-op and removes the corresponding test. Additionally, it ensures the `subConnWrapper` is removed from the ejection/unejection map when a SubConn closes to prevent memory leaks. RELEASE NOTES: N/A

…ilure (grpc#8805) Fixes: grpc#8801 There were 2 problems here : 1. The actual DNS lookup calls return different resolved addresses for "localhost" on different machines. Changed the code to replace the DNS resolver with a manual resolver in test to mock the DNS resolver. 2. In the case where the tests send a EDS or cluster error , the xds management server keeps sending resource error continuously which in turn triggers updates from dependency manager which pushes the [update to a channel.](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L97) To mitigate this situation we had a done channel , that we [close when the test ends](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1273) and the update function [check](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L95) for the done channel , if it is close, it returns. In [TestAggregateCLusterChildError](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1170) test , (and other tests too) , the channel was being closed at the end of the test , which meant it will close only when the test passes. And if the tests fails, the done channel was not getting closed , which made the update function to block. Changed the close(done channel) to be a defer function , but this should be the first thing that happens after the test is done , before the dependency manager closes, so this is the last defer called in the test. Also fixes `TestRouteConfigResourceError` flaking due to a similar error. RELEASE NOTES: None

Addresses grpc#1824 This PR adds a test for the case where the client specifies a codec that is unsupported on the server. In a previous PR, we had added a warning for this scenario. In a subsequent PR, we will add an environment variable to gate the new behavior, where the server will not default to using the `proto` codec for unsupported codecs, but instead will fail the request. RELEASE NOTES: None

This PR remove the `grpchttp2` package since we no longer want to implement a custom http2 framer in grpc. We originally planned a custom HTTP/2 framer to get rid of internal copies. However, we have updated gRPC to control data frame I/O ([grpc#8667](grpc#8667), [grpc#8657](grpc#8657)) and will have changes that removes the bufio.Reader copy. This removes the need for a custom framer. RELEASE NOTES: None

Fixes grpc#8798 RELEASE NOTES: none

RELEASE NOTES: N/A

…rpc#8818) This adds e2e style tests to verify that when JWT call credentials (which require transport security) are used with an insecure transport, the RPC fails with a meaningful error as expected per gRFC A97. Fixes grpc#8635 RELEASE NOTES: none --------- Signed-off-by: iamrajiv <rajivperfect007@gmail.com>

If there was an error, e.g. while decompressing the data, we still want to return all buffers to the pool. RELEASE NOTES: None

Fixes: grpc#8883 This change doubles the timeout for `EDS_MultipleLocalities` as the test sends around 5000 RPCs and may time out on slow machines. RELEASE NOTES: N/A

…ues (grpc#8867) It's nicer than iterating over the keys and looking up their values. RELEASE NOTES: * resolver: Add `All()` methods to `AddressMap` and `EndpointMap` that return iterators for efficient key-value pair traversal.

While working on xDS HTTP filter state retention changes required for A83 and A86 in grpc#8745, I realized that the current implementation of HTTP filters on the server-side does not lend itself well to the changes we want to make to support filter state retention. The crux of the problem in the current implementation is as follows: - The `xdsresource` package builds a `FilterChainManager` type that does the following: - Keeps track of the list of Route Configuration resources to be requested based on the HTTP filters configured on the Listener resource - Updates the interceptors on the different filter chains based on Route Configuration resource updates (that contain filter config overrides) - Maintaining runtime state in a type that is supposed to represent a configuration blob seems to have been a wrong implementation choice - Everytime a Listener resource update is received, the `xdsresource` package builds a brand new `FilterChainManager` - This means that we *cannot* retain filter state across resource updates This PR makes the following changes: - Changes the layout of the `ListenerUpdate` struct to clearly separate client-side and server-side fields - Also groups information that is shared across client and server listener types into a logical struct - The `ListenerUpdate` will *only* contain information directly sourced from the listener proto - The xDS server implementation will build the runtime representation of the filter chains using the configuration update provided by the `xdsresource` package - This would move the responsibility of updating the interceptors on the filter chain to the xDS server, which is the correct place to be doing such a thing - This would make it possible for the xDS server implementation to retain filter state across listener resource updates RELEASE NOTES: none

This PR increases the expiry of the CAs from 1 year to 10 years and re-generates the certificates. RELEASE NOTES: N/A

…rpc#8874) This resolves an issue where below log statement keeps printing repeatedly: "ERROR: [otel-plugin] ctx passed into client side stats handler metrics event handling has no client attempt data present". When new stream is created for health/orca producers, stats and tracing is not setup. However, this fact is ignored during RPC and an error logs is printed to denote that stats cannot be handled. We will enable stream to have its own reference to the stats handler and only process per RPC implementation when it is present (like in case of regular data streams). Internal issue: b/385685802 RELEASE NOTES: * stats: only process RPC stats/tracing in health and ORCA producers if a handler is configured, preventing unnecessary error logging

mbissa · 2026-02-16T06:39:54Z

merging master (rebasing due to staleness)

arjan-bal · 2026-02-17T08:20:30Z

+	tmr.ClearMetrics()
+	dc.reportMetrics(tmr, dc.currentSize, int64(len(dc.entries)), dc.rlsServerTarget)


Since reportMetrics is just a pass-through, this unit test adds little value. Let's move it to the RLS balancer tests: trigger actions -> verify metrics. If making that deterministic is too complex, let me know.

There are existing tests in metrics_test.go that may have some coverage already.

Done, removed the test here.
The tests in metrics_test.go verify the grpc.lb.rls.cache_entries and grpc.lb.rls.cache_size metrics in the context of the full balancer lifecycle.

arjan-bal

LGTM

arjan-bal · 2026-03-11T08:13:34Z

Please merge the master branch to unblock go1.26 CI workflows that were recently added.

…work

ejona86 · 2026-03-11T20:06:59Z

It looks like the commit from this PR was busted, with all the Co-authored-by.
98573cb

Looking like merging/rebasing went wrong and someone didn't check when merging.

mbissa · 2026-03-12T09:40:29Z

It looks like the commit from this PR was busted, with all the Co-authored-by. 98573cb

Looking like merging/rebasing went wrong and someone didn't check when merging.

There were many merge conflicts due to this PR not seeing activity in last 30-45 days. I thought merge commits are supposed to have this. I will ensure that this doesn't happen next time.

moving rls cache metrics to use async gauge framework

07a7921

mbissa added this to the 1.79 Release milestone Jan 7, 2026

mbissa requested a review from arjan-bal January 7, 2026 02:27

mbissa self-assigned this Jan 7, 2026

mbissa added the Type: Feature New features or improvements in behavior label Jan 7, 2026

mbissa assigned arjan-bal Jan 7, 2026

arjan-bal reviewed Jan 7, 2026

View reviewed changes

arjan-bal removed their assignment Jan 7, 2026

arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label Jan 7, 2026

arjan-bal reviewed Jan 7, 2026

View reviewed changes

Comment thread balancer/rls/balancer.go Outdated

Pranjali-2501 modified the milestones: 1.79 Release, 1.80 Release Jan 21, 2026

eshitachandwani and others added 17 commits February 16, 2026 06:36

transport: Remove unused return value from connect() (grpc#8787)

7f8b169

See: * grpc#8666 (comment) RELEASE NOTES: none Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>

grpc: improve the docstring for StreamServerInterceptor (grpc#8823)

083fc97

Fixes grpc#8798 RELEASE NOTES: none

arjan-bal and others added 9 commits February 16, 2026 06:36

xds: split service and message codegen import (grpc#8881)

43bd5cb

RELEASE NOTES: N/A

mem: ensure Reader buffers are reused on early return (grpc#8886)

eac21d1

If there was an error, e.g. while decompressing the data, we still want to return all buffers to the pool. RELEASE NOTES: None

cdsbalancer: Increase timeout for test (grpc#8889)

23d3e8c

Fixes: grpc#8883 This change doubles the timeout for `EDS_MultipleLocalities` as the test sends around 5000 RPCs and may time out on slow machines. RELEASE NOTES: N/A

advancedtls: re-generate test certs and increase expiry (grpc#8900)

d193f7b

This PR increases the expiry of the CAs from 1 year to 10 years and re-generates the certificates. RELEASE NOTES: N/A

Merge branch 'master' into move-rls-cache-metrics-to-async-framework

415a65b

review comments

b688f86

mbissa force-pushed the move-rls-cache-metrics-to-async-framework branch from aa21c76 to b688f86 Compare February 16, 2026 08:57

mbissa assigned arjan-bal and unassigned mbissa Feb 16, 2026

arjan-bal reviewed Feb 17, 2026

View reviewed changes

arjan-bal assigned mbissa and unassigned arjan-bal Feb 17, 2026

mbissa modified the milestones: 1.80 Release, 1.81 Release Mar 6, 2026

review comments

85d8b2c

mbissa assigned arjan-bal and unassigned mbissa Mar 10, 2026

arjan-bal approved these changes Mar 11, 2026

View reviewed changes

arjan-bal assigned mbissa and unassigned arjan-bal Mar 11, 2026

Merge branch 'grpc:master' into move-rls-cache-metrics-to-async-frame…

c12db10

…work

mbissa merged commit 98573cb into grpc:master Mar 11, 2026
14 checks passed

		tmr.ClearMetrics()
		dc.reportMetrics(tmr, dc.currentSize, int64(len(dc.entries)), dc.rlsServerTarget)

Conversation

mbissa commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbissa commented Feb 16, 2026

Uh oh!

Uh oh!

arjan-bal Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

mbissa Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

arjan-bal commented Mar 11, 2026

Uh oh!

Uh oh!

ejona86 commented Mar 11, 2026

Uh oh!

mbissa commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

mbissa commented Jan 7, 2026 •

edited

Loading

codecov Bot commented Jan 7, 2026 •

edited

Loading