rls: update rls cache metrics to use async gauge framework#8808
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8808 +/- ##
==========================================
- Coverage 83.26% 83.25% -0.01%
==========================================
Files 410 410
Lines 32576 32598 +22
==========================================
+ Hits 27123 27139 +16
- Misses 4062 4066 +4
- Partials 1391 1393 +2
🚀 New features to boost your workflow:
|
arjan-bal
left a comment
There was a problem hiding this comment.
I think we can clarify the meaning of asynchronous in the release notes by adding something like: "The metric is recorded once per collection cycle, rather than every time its value changes."
…grpc#8793) This is Part of A74 changes. This PR add a new `IsDynamic` field to the CDS balancer LB Config. Also sets it to true for RLS cluster specifier plugin. This PR also fixes the test in rls_test.go which earlier was returning if the test has `wantError=true` effectively not testing the cases after that. This will be used to dynamically start watch for cluster specifier plugin clusters from cds balancer. RELEASE NOTES: None
The change sets an initial estimate for the `mem.BufferSlice` capacity
while reading messages to avoid reallocations for messages larger than
16KB.
## Benchmarks
```sh
$ go run benchmark/benchresult/main.go streaming-before streaming-after
streaming-networkMode_Local-bufConn_true-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentC
alls_120-reqSize_16500B-respSize_16500B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteB
ufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-shar
edWriteBuffer_false
Title Before After Percentage
TotalOps 5040044 5024089 -0.32%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 75499.82 75446.71 -0.07%
Allocs/op 26.24 24.22 -7.62%
ReqT/op 11088096800.00 11052995800.00 -0.32%
RespT/op 11088096800.00 11052995800.00 -0.32%
50th-Lat 1.215848ms 1.22471ms 0.73%
90th-Lat 2.258081ms 2.25736ms -0.03%
99th-Lat 2.78024ms 2.756574ms -0.85%
Avg-Lat 1.427389ms 1.431989ms 0.32%
GoVersion go1.24.8 go1.24.8
GrpcVersion 1.79.0-dev 1.79.0-dev
```
RELEASE NOTES:
* transport: Reduce slice re-allocations by reserving slice capacity.
grpc#7378 added an internal API to propagate the locality ID of each address used by a subchannel. This was required because the legacy pick_first implementation created subchannels with multiple addresses, each with a potentially different locality. Since the legacy implementation has been removed, and the new pick_first creates subchannels with only a single address, this workaround is now redundant. This PR also make the `UpdateAddresses` method of `clusterImpl` balancer a no-op since none of the leaf policies use it anymore. RELEASE NOTES: N/A
This change migrates usages of ResolverState.Addresses to ResolverState.Endpoints. Prior to Dualstack support, xDS LB policies relied exclusively on the `Addresses` field. As part of the Dualstack initiative, these policies were updated to support both `Endpoints` and `Addresses`. This PR is the final step to deprecate the `Addresses` propagation logic in favor of relying solely on `Endpoints`. RELEASE NOTES: N/A
When a connection attempt is canceled after the transport is created, ensure that it is closed inline. This prevents untracked goroutines from being left behind. See: * grpc#8655 * grpc#8666 (comment) RELEASE NOTES: * When canceling a connection attempt, closing the transport won't use an untracked goroutine anymore. Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
… CONNECTING to CONNECTING (grpc#8813) Fixes grpc#8516 This PR ensures that the init timer is not restarted when a child policy transitions from CONNECTING to CONNECTING. This ensures that we will only ever wait for `DefaultPriorityInitTimeout` which is set to a value of `10s` for a given child to become `Ready` before we attempt a failover to the next highest priority child. RELEASE NOTES: - xds/priority: Fixed a bug causing delayed failover to lower-priority clusters when a higher-priority cluster is stuck in the CONNECTING state.
See: * grpc#8666 (comment) RELEASE NOTES: none Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
stream: initialize decompressor for health check streams Fixes grpc#8162 Health checks fail when using legacy compression options (WithDecompressor on client or RPCDecompressor on server). The health check stream created via newNonRetryClientStream does not initialize the decompressorV0 field with the legacy decompressor from dial options, causing an Internal error when attempting to decompress server responses. This fix initializes the decompressorV0 field in addrConnStream with ac.cc.dopts.dc, matching the behavior of the regular clientStream path (line 492) to ensure health check streams can properly decompress responses when legacy compression options are configured. RELEASE NOTES: * health: Fixed health checks failing when using legacy compression options WithDecompressor or RPCDecompressor.
…d correct resources (grpc#8794) This PR fixes the `TestErrorFromParentLB_ResourceNotFound` to reconfigure the same CDS and EDS resources after they have been removed instead of configuring all the new resources. Also removes unused `defaultTestShortTimeout` from [internal/xds/balancer/clustermanager/clustermanager_test.go](https://github.com/grpc/grpc-go/blob/900ffa948c031ddd50a667750f7cba8f01c66c4d/internal/xds/balancer/clustermanager/clustermanager_test.go#L51) RELEASE NOTES: None
Use new-style atomic APIs instead of the old ones in the WRR scheduler type. The changes made in this PR improve the code in the following ways: - Ergonomics: Method-based API vs function-based, no pointer management needed - Safety: Type safety prevents mixing atomic/non-atomic operations, eliminates pointer errors - Clarity: The atomic.Uint32 type makes atomic intent explicit from declaration RELEASE NOTES: none
[Splitting a `buffer`](https://github.com/grpc/grpc-go/blob/40466769682557e7179b8c74ba3820cc78d49b4b/mem/buffers.go#L172-L187) results in fetching a new `buffer` object from a `sync.Pool`. The `buffer` object is returned back to the pool only once [the shared ref count falls to 0](https://github.com/grpc/grpc-go/blob/40466769682557e7179b8c74ba3820cc78d49b4b/mem/buffers.go#L152-L155). As a result, only one of the `buffer` objects is returned back to the pool for re-use. The "leaked" buffer objects may cause noticeable allocations when buffers are split more frequently. I noticed this when [attempting to remove a buffer copy](https://github.com/grpc/grpc-go/compare/master...arjan-bal:zero-copy-buf-reader?expand=1) by replacing the bufio.Reader. ## Solution This PR introduces a root-owner model for the underlying `*[]byte` within `buffer` objects. The root object manages the slice's lifecycle, returning it to the pool only when its reference count reaches zero. When a `buffer` is split, the new `buffer` is treated as a child, incrementing the ref counts for both itself and the root. Once a child’s ref count hits zero, it returns itself to the pool and decrements the root’s count. Additionally, this PR removes the `sync.Pool` used for `*atomic.Int32` by embedding `atomic.Int32` as a value field within the `buffer` struct. By eliminating the second pool and the associated pointer indirection, we reduce allocation overhead and improve cache locality during buffer lifecycle events. ## Benchmarks A micro-benchmark showing the buffer object leak: ```go func BenchmarkSplit(b *testing.B) { pool := mem.DefaultBufferPool() b.Run("split", func(b *testing.B) { for b.Loop() { size := 1 << 15 // 32 KB slice := pool.Get(size) buf := mem.NewBuffer(slice, pool) left, right := mem.SplitUnsafe(buf, size/2) left.Free() right.Free() } }) b.Run("no-split", func(b *testing.B) { for b.Loop() { size := 1 << 15 // 32 KB slice := pool.Get(size) buf := mem.NewBuffer(slice, pool) buf.Free() } }) } ``` Result on master vs this PR. ```sh goos: linux goarch: amd64 pkg: google.golang.org/grpc/mem cpu: Intel(R) Xeon(R) CPU @ 2.60GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Split/split-48 418.2n ± 0% 263.9n ± 1% -36.89% (p=0.000 n=10) Split/no-split-48 221.1n ± 1% 208.5n ± 0% -5.70% (p=0.000 n=10) geomean 304.1n 234.6n -22.86% │ old.txt │ new.txt │ │ B/op │ B/op vs base │ Split/split-48 64.00 ± 0% 0.00 ± 0% -100.00% (p=0.000 n=10) Split/no-split-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² ? ² ³ ¹ all samples are equal ² summaries must be >0 to compute geomean ³ ratios must be >0 to compute geomean │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ Split/split-48 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) Split/no-split-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² ? ² ³ ¹ all samples are equal ² summaries must be >0 to compute geomean ³ ratios must be >0 to compute geomean ``` The effect on local gRPC benchmarks is negligible since the `SplitUnsafe` function isn't called very frequently. ```sh $ go run benchmark/benchresult/main.go unary-before unary-after unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurr entCalls_120-reqSize_16000B-respSize_16000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-c lientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBuff erPool_simple-sharedWriteBuffer_false Title Before After Percentage TotalOps 2985694 3024364 1.30% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 74784.94 74784.99 0.00% Allocs/op 133.67 133.89 0.00% ReqT/op 6369480533.33 6451976533.33 1.30% RespT/op 6369480533.33 6451976533.33 1.30% 50th-Lat 2.410033ms 2.40116ms -0.37% 90th-Lat 3.145118ms 3.081771ms -2.01% 99th-Lat 3.563055ms 3.629663ms 1.87% Avg-Lat 2.410529ms 2.379513ms -1.29% GoVersion go1.24.8 go1.24.8 GrpcVersion 1.78.0-dev 1.78.0-dev ``` RELEASE NOTES: * mem: Improve pooling of `buffer` objects on using `SplitUnsafe`.
Following the dual-stack changes that established `pick_first` as the universal leaf policy and the subsequent removal of the legacy `pick_first` implementation, `ClientConn.UpdateAddresses` is no longer utilized within the xDS LB policy tree. When addresses change, old `SubConns` are now shut down and new ones are created. This PR makes `UpdateAddresses` a no-op and removes the corresponding test. Additionally, it ensures the `subConnWrapper` is removed from the ejection/unejection map when a SubConn closes to prevent memory leaks. RELEASE NOTES: N/A
…ilure (grpc#8805) Fixes: grpc#8801 There were 2 problems here : 1. The actual DNS lookup calls return different resolved addresses for "localhost" on different machines. Changed the code to replace the DNS resolver with a manual resolver in test to mock the DNS resolver. 2. In the case where the tests send a EDS or cluster error , the xds management server keeps sending resource error continuously which in turn triggers updates from dependency manager which pushes the [update to a channel.](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L97) To mitigate this situation we had a done channel , that we [close when the test ends](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1273) and the update function [check](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L95) for the done channel , if it is close, it returns. In [TestAggregateCLusterChildError](https://github.com/grpc/grpc-go/blob/bccbb10454782f18ecdfd60ca39ef413357edb0b/internal/xds/xdsdepmgr/xds_dependency_manager_test.go#L1170) test , (and other tests too) , the channel was being closed at the end of the test , which meant it will close only when the test passes. And if the tests fails, the done channel was not getting closed , which made the update function to block. Changed the close(done channel) to be a defer function , but this should be the first thing that happens after the test is done , before the dependency manager closes, so this is the last defer called in the test. Also fixes `TestRouteConfigResourceError` flaking due to a similar error. RELEASE NOTES: None
Addresses grpc#1824 This PR adds a test for the case where the client specifies a codec that is unsupported on the server. In a previous PR, we had added a warning for this scenario. In a subsequent PR, we will add an environment variable to gate the new behavior, where the server will not default to using the `proto` codec for unsupported codecs, but instead will fail the request. RELEASE NOTES: None
This PR remove the `grpchttp2` package since we no longer want to implement a custom http2 framer in grpc. We originally planned a custom HTTP/2 framer to get rid of internal copies. However, we have updated gRPC to control data frame I/O ([grpc#8667](grpc#8667), [grpc#8657](grpc#8657)) and will have changes that removes the bufio.Reader copy. This removes the need for a custom framer. RELEASE NOTES: None
Fixes grpc#8798 RELEASE NOTES: none
RELEASE NOTES: N/A
…rpc#8818) This adds e2e style tests to verify that when JWT call credentials (which require transport security) are used with an insecure transport, the RPC fails with a meaningful error as expected per gRFC A97. Fixes grpc#8635 RELEASE NOTES: none --------- Signed-off-by: iamrajiv <rajivperfect007@gmail.com>
If there was an error, e.g. while decompressing the data, we still want to return all buffers to the pool. RELEASE NOTES: None
Fixes: grpc#8883 This change doubles the timeout for `EDS_MultipleLocalities` as the test sends around 5000 RPCs and may time out on slow machines. RELEASE NOTES: N/A
…ues (grpc#8867) It's nicer than iterating over the keys and looking up their values. RELEASE NOTES: * resolver: Add `All()` methods to `AddressMap` and `EndpointMap` that return iterators for efficient key-value pair traversal.
While working on xDS HTTP filter state retention changes required for A83 and A86 in grpc#8745, I realized that the current implementation of HTTP filters on the server-side does not lend itself well to the changes we want to make to support filter state retention. The crux of the problem in the current implementation is as follows: - The `xdsresource` package builds a `FilterChainManager` type that does the following: - Keeps track of the list of Route Configuration resources to be requested based on the HTTP filters configured on the Listener resource - Updates the interceptors on the different filter chains based on Route Configuration resource updates (that contain filter config overrides) - Maintaining runtime state in a type that is supposed to represent a configuration blob seems to have been a wrong implementation choice - Everytime a Listener resource update is received, the `xdsresource` package builds a brand new `FilterChainManager` - This means that we *cannot* retain filter state across resource updates This PR makes the following changes: - Changes the layout of the `ListenerUpdate` struct to clearly separate client-side and server-side fields - Also groups information that is shared across client and server listener types into a logical struct - The `ListenerUpdate` will *only* contain information directly sourced from the listener proto - The xDS server implementation will build the runtime representation of the filter chains using the configuration update provided by the `xdsresource` package - This would move the responsibility of updating the interceptors on the filter chain to the xDS server, which is the correct place to be doing such a thing - This would make it possible for the xDS server implementation to retain filter state across listener resource updates RELEASE NOTES: none
This PR increases the expiry of the CAs from 1 year to 10 years and re-generates the certificates. RELEASE NOTES: N/A
…rpc#8874) This resolves an issue where below log statement keeps printing repeatedly: "ERROR: [otel-plugin] ctx passed into client side stats handler metrics event handling has no client attempt data present". When new stream is created for health/orca producers, stats and tracing is not setup. However, this fact is ignored during RPC and an error logs is printed to denote that stats cannot be handled. We will enable stream to have its own reference to the stats handler and only process per RPC implementation when it is present (like in case of regular data streams). Internal issue: b/385685802 RELEASE NOTES: * stats: only process RPC stats/tracing in health and ORCA producers if a handler is configured, preventing unnecessary error logging
|
merging master (rebasing due to staleness) |
aa21c76 to
b688f86
Compare
| tmr.ClearMetrics() | ||
| dc.reportMetrics(tmr, dc.currentSize, int64(len(dc.entries)), dc.rlsServerTarget) |
There was a problem hiding this comment.
Since reportMetrics is just a pass-through, this unit test adds little value. Let's move it to the RLS balancer tests: trigger actions -> verify metrics. If making that deterministic is too complex, let me know.
There are existing tests in metrics_test.go that may have some coverage already.
There was a problem hiding this comment.
Done, removed the test here.
The tests in metrics_test.go verify the grpc.lb.rls.cache_entries and grpc.lb.rls.cache_size metrics in the context of the full balancer lifecycle.
|
Please merge the master branch to unblock go1.26 CI workflows that were recently added. |
|
It looks like the commit from this PR was busted, with all the Co-authored-by. Looking like merging/rebasing went wrong and someone didn't check when merging. |
There were many merge conflicts due to this PR not seeing activity in last 30-45 days. I thought merge commits are supposed to have this. I will ensure that this doesn't happen next time. |
This PR changes the way rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to use async gauge framework instead of synchronoush capturing.
This also does cleanup to remove the NoopMetricsRecorder struct along with addressing some old nits.
RELEASE NOTES: