Skip to content

rate_limit_quota: fix ASAN flake in GlobalRateLimitClientImpl's RLQS Stream#41053

Merged
paul-r-gall merged 4 commits intoenvoyproxy:mainfrom
bsurber:fix-asan-error-in-rlqs-client
Sep 20, 2025
Merged

rate_limit_quota: fix ASAN flake in GlobalRateLimitClientImpl's RLQS Stream#41053
paul-r-gall merged 4 commits intoenvoyproxy:mainfrom
bsurber:fix-asan-error-in-rlqs-client

Conversation

@bsurber
Copy link
Copy Markdown
Contributor

@bsurber bsurber commented Sep 11, 2025

Commit Message: The RLQS async stream in GlobalRateLimitClientImpl (stream_) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the GlobalRateLimitClientImpl can instead own its RawAsyncClient & delete it to guarantee that any of its active streams are cleaned up.


Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding waitForRlqsStream() to check all fake upstream connections for new streams, not just the first.


Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR #40497

@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #41053 was opened by bsurber.

see: more, trace.

…l. The active stream is ensured to be reset by owning & deleting its parent rlqs client instead.

Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber bsurber force-pushed the fix-asan-error-in-rlqs-client branch from 4014663 to 15307a0 Compare September 11, 2025 22:34
Signed-off-by: Brian Surber <bsurber@google.com>
Signed-off-by: Brian Surber <bsurber@google.com>
Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber bsurber marked this pull request as ready for review September 12, 2025 20:38
@tyxia
Copy link
Copy Markdown
Member

tyxia commented Sep 15, 2025

@paul-r-gall Since you have reviewed the PR, do you want to take another pass to see if it looks good to you?

Thanks!

ENVOY_LOG(debug, "gRPC stream closed remotely with status {}: {}", status, message);
stream_ = nullptr;
});
ASSERT_IS_MAIN_OR_TEST_THREAD();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some digging and convinced myself that this will be true for both the Envoy gRPC client and the Google gRPC client.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I added a few of those assertions, mostly for my own surety around the callbacks & global client construction + destruction.
Are you looking for me to remove those for consistency across functions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, no action needs to be taken; these are typically compiled away in most production builds. maybe a confirmation from @tyxia that removing the main thread post is safe.

Copy link
Copy Markdown
Contributor Author

@bsurber bsurber Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For onRemoteClose(...) specifically, it makes sense to happen synchronously, because the stream object is unavailable as soon as the remote close happens.

The ASSERT is there to check my understanding about the clients' threading. It appears that the client saves & operates off of the thread_id and/or dispatcher from the thread that created it (e.g. usage of isThreadSafe()). The child streams pull in the dispatcher from their parent client & post callbacks to it (google_async_client_impl example). In the global client's case, that thread is the main thread.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async client in RLQS is created on the main thread, thus onRemoteClose here is called on main thread.

@bsurber Could you add some comments. You can do it in a follow-up PR, to avoid going through the CI again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@paul-r-gall
Copy link
Copy Markdown
Contributor

A general comment that using a shared rate limit client is a performance optimization. I am slightly concerned that removing it will cause a performance degradation and am curious if you have attempted any benchmarking.

@bsurber
Copy link
Copy Markdown
Contributor Author

bsurber commented Sep 15, 2025

A general comment that using a shared rate limit client is a performance optimization. I am slightly concerned that removing it will cause a performance degradation and am curious if you have attempted any benchmarking.

In this case, the rate limiting client is a RateLimitQuotaService client, which operates entirely outside of the critical path with asynchronous flows. All sending of Usage Reports, processing of Responses, cached assignment expirations+fallbacks, etc are handled by the main thread.
The only shared data on the filter's critical path is a thread-local shared_ptr to a cache, which the filter can read from safely. The only potential contention will come from hitting a TokenBucket shared across threads etc (specifically the TB's atomics) and incrementing the allowed/blocked atomics in the usage cache. Actually, the contention around atomics won't even be increasing with this PR, as max-contenders == worker thread count regardless.

}

absl::StatusOr<Grpc::RawAsyncClientPtr> rlqs_stream_client =
(*rlqs_stream_client_factory)->createUncachedRawAsyncClient();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ:
With the change, we are going to create the async client every time, basically disable the async client cache feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to the persistence work in #40497, this global client object gets created once per combination of domain + rlqs target, and then continues to live so long as at least one filter factory continues to reference that same combination. The unique_ptr guarantees a single async client's creation & deletion on the same timeframe.

@tyxia
Copy link
Copy Markdown
Member

tyxia commented Sep 19, 2025

/retest

@paul-r-gall paul-r-gall merged commit dd32d43 into envoyproxy:main Sep 20, 2025
24 checks passed
mbadov pushed a commit to mbadov/envoy that referenced this pull request Sep 22, 2025
…Stream (envoyproxy#41053)

Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl`
(`stream_`) doesn't actually own the underlying raw stream ptr. This was
causing a race condition during shutdown, with the cluster-manager's
deferred stream reset+deletion racing against the global client's
deferred deletion. If the deferred global client deletion triggered
first, without resetting the stream, then the cluster-manager would fail
in its own stream reset attempt (the stream's callbacks having been
deleted with the global client). If the global client guarantees stream
reset + deletion, and the cluster manager wins the race, then the global
client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the `GlobalRateLimitClientImpl` can
instead own its `RawAsyncClient` & delete it to guarantee that any of
its active streams are cleaned up.

-------- 
Additional Description: With the owned RawAsyncClient, integration
testing saw a new flake where sometimes the first connection to a fake
upstream failed immediately with an empty-message internal error. This
was addressed by adding `waitForRlqsStream()` to check all fake upstream
connections for new streams, not just the first.

--------
Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test
run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR envoyproxy#40497

---------

Signed-off-by: Brian Surber <bsurber@google.com>
Signed-off-by: Misha Badov <mbadov@google.com>
lucaschimweg pushed a commit to lucaschimweg/envoy that referenced this pull request Sep 23, 2025
…Stream (envoyproxy#41053)

Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl`
(`stream_`) doesn't actually own the underlying raw stream ptr. This was
causing a race condition during shutdown, with the cluster-manager's
deferred stream reset+deletion racing against the global client's
deferred deletion. If the deferred global client deletion triggered
first, without resetting the stream, then the cluster-manager would fail
in its own stream reset attempt (the stream's callbacks having been
deleted with the global client). If the global client guarantees stream
reset + deletion, and the cluster manager wins the race, then the global
client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the `GlobalRateLimitClientImpl` can
instead own its `RawAsyncClient` & delete it to guarantee that any of
its active streams are cleaned up.

-------- 
Additional Description: With the owned RawAsyncClient, integration
testing saw a new flake where sometimes the first connection to a fake
upstream failed immediately with an empty-message internal error. This
was addressed by adding `waitForRlqsStream()` to check all fake upstream
connections for new streams, not just the first.

--------
Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test
run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR envoyproxy#40497

---------

Signed-off-by: Brian Surber <bsurber@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants