rate_limit_quota: fix ASAN flake in GlobalRateLimitClientImpl's RLQS Stream by bsurber · Pull Request #41053 · envoyproxy/envoy

bsurber · 2025-09-11T22:15:48Z

Commit Message: The RLQS async stream in GlobalRateLimitClientImpl (stream_) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the GlobalRateLimitClientImpl can instead own its RawAsyncClient & delete it to guarantee that any of its active streams are cleaned up.

Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding waitForRlqsStream() to check all fake upstream connections for new streams, not just the first.

Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR #40497

repokitteh-read-only · 2025-09-11T22:15:53Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #41053 was opened by bsurber.

see: more, trace.

…l. The active stream is ensured to be reset by owning & deleting its parent rlqs client instead. Signed-off-by: Brian Surber <bsurber@google.com>

Signed-off-by: Brian Surber <bsurber@google.com>

source/extensions/filters/http/rate_limit_quota/global_client_impl.h

Signed-off-by: Brian Surber <bsurber@google.com>

tyxia · 2025-09-15T12:22:36Z

@paul-r-gall Since you have reviewed the PR, do you want to take another pass to see if it looks good to you?

Thanks!

paul-r-gall · 2025-09-15T14:54:15Z

source/extensions/filters/http/rate_limit_quota/global_client_impl.cc

-    ENVOY_LOG(debug, "gRPC stream closed remotely with status {}: {}", status, message);
-    stream_ = nullptr;
-  });
+  ASSERT_IS_MAIN_OR_TEST_THREAD();


I did some digging and convinced myself that this will be true for both the Envoy gRPC client and the Google gRPC client.

Yeah I added a few of those assertions, mostly for my own surety around the callbacks & global client construction + destruction.
Are you looking for me to remove those for consistency across functions?

no, no action needs to be taken; these are typically compiled away in most production builds. maybe a confirmation from @tyxia that removing the main thread post is safe.

For onRemoteClose(...) specifically, it makes sense to happen synchronously, because the stream object is unavailable as soon as the remote close happens.

The ASSERT is there to check my understanding about the clients' threading. It appears that the client saves & operates off of the thread_id and/or dispatcher from the thread that created it (e.g. usage of isThreadSafe()). The child streams pull in the dispatcher from their parent client & post callbacks to it (google_async_client_impl example). In the global client's case, that thread is the main thread.

The async client in RLQS is created on the main thread, thus onRemoteClose here is called on main thread.

@bsurber Could you add some comments. You can do it in a follow-up PR, to avoid going through the CI again.

paul-r-gall · 2025-09-15T15:24:13Z

A general comment that using a shared rate limit client is a performance optimization. I am slightly concerned that removing it will cause a performance degradation and am curious if you have attempted any benchmarking.

bsurber · 2025-09-15T22:37:09Z

A general comment that using a shared rate limit client is a performance optimization. I am slightly concerned that removing it will cause a performance degradation and am curious if you have attempted any benchmarking.

In this case, the rate limiting client is a RateLimitQuotaService client, which operates entirely outside of the critical path with asynchronous flows. All sending of Usage Reports, processing of Responses, cached assignment expirations+fallbacks, etc are handled by the main thread.
The only shared data on the filter's critical path is a thread-local shared_ptr to a cache, which the filter can read from safely. The only potential contention will come from hitting a TokenBucket shared across threads etc (specifically the TB's atomics) and incrementing the allowed/blocked atomics in the usage cache. Actually, the contention around atomics won't even be increasing with this PR, as max-contenders == worker thread count regardless.

tyxia · 2025-09-19T18:41:28Z

source/extensions/filters/http/rate_limit_quota/global_client_impl.cc

+  }
+
+  absl::StatusOr<Grpc::RawAsyncClientPtr> rlqs_stream_client =
+      (*rlqs_stream_client_factory)->createUncachedRawAsyncClient();


QQ:
With the change, we are going to create the async client every time, basically disable the async client cache feature?

Thanks to the persistence work in #40497, this global client object gets created once per combination of domain + rlqs target, and then continues to live so long as at least one filter factory continues to reference that same combination. The unique_ptr guarantees a single async client's creation & deletion on the same timeframe.

tyxia · 2025-09-19T18:50:10Z

/retest

…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <bsurber@google.com> Signed-off-by: Misha Badov <mbadov@google.com>

…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <bsurber@google.com>

Address lack of rlqs stream ptr ownership in GlobalRateLimitClientImp…

15307a0

…l. The active stream is ensured to be reset by owning & deleting its parent rlqs client instead. Signed-off-by: Brian Surber <bsurber@google.com>

bsurber force-pushed the fix-asan-error-in-rlqs-client branch from 4014663 to 15307a0 Compare September 11, 2025 22:34

bsurber added 2 commits September 11, 2025 23:49

Minor cleanup & testing to improve coverage

1884452

Signed-off-by: Brian Surber <bsurber@google.com>

Fix the reason for the extra, immediately failing connections

7abc415

Signed-off-by: Brian Surber <bsurber@google.com>

paul-r-gall reviewed Sep 12, 2025

View reviewed changes

source/extensions/filters/http/rate_limit_quota/global_client_impl.h Show resolved Hide resolved

source/extensions/filters/http/rate_limit_quota/global_client_impl.h Outdated Show resolved Hide resolved

Minor cleanup & return to using config_with_hash_key

4e3156c

Signed-off-by: Brian Surber <bsurber@google.com>

bsurber marked this pull request as ready for review September 12, 2025 20:38

bsurber requested review from tyxia and yanavlasov as code owners September 12, 2025 20:38

tyxia assigned tyxia and paul-r-gall Sep 15, 2025

paul-r-gall reviewed Sep 15, 2025

View reviewed changes

tyxia reviewed Sep 19, 2025

View reviewed changes

tyxia approved these changes Sep 19, 2025

View reviewed changes

paul-r-gall merged commit dd32d43 into envoyproxy:main Sep 20, 2025
24 checks passed

Conversation

bsurber commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

tyxia commented Sep 15, 2025

Uh oh!

paul-r-gall Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

bsurber Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

paul-r-gall Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

bsurber Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyxia Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

bsurber Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

paul-r-gall commented Sep 15, 2025

Uh oh!

bsurber commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tyxia Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

bsurber Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

tyxia commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsurber commented Sep 11, 2025 •

edited

Loading

bsurber Sep 16, 2025 •

edited

Loading

bsurber commented Sep 15, 2025 •

edited

Loading