rate_limit_quota: fix ASAN flake in GlobalRateLimitClientImpl's RLQS Stream#41053
Conversation
…l. The active stream is ensured to be reset by owning & deleting its parent rlqs client instead. Signed-off-by: Brian Surber <bsurber@google.com>
4014663 to
15307a0
Compare
Signed-off-by: Brian Surber <bsurber@google.com>
Signed-off-by: Brian Surber <bsurber@google.com>
source/extensions/filters/http/rate_limit_quota/global_client_impl.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Brian Surber <bsurber@google.com>
|
@paul-r-gall Since you have reviewed the PR, do you want to take another pass to see if it looks good to you? Thanks! |
| ENVOY_LOG(debug, "gRPC stream closed remotely with status {}: {}", status, message); | ||
| stream_ = nullptr; | ||
| }); | ||
| ASSERT_IS_MAIN_OR_TEST_THREAD(); |
There was a problem hiding this comment.
I did some digging and convinced myself that this will be true for both the Envoy gRPC client and the Google gRPC client.
There was a problem hiding this comment.
Yeah I added a few of those assertions, mostly for my own surety around the callbacks & global client construction + destruction.
Are you looking for me to remove those for consistency across functions?
There was a problem hiding this comment.
no, no action needs to be taken; these are typically compiled away in most production builds. maybe a confirmation from @tyxia that removing the main thread post is safe.
There was a problem hiding this comment.
For onRemoteClose(...) specifically, it makes sense to happen synchronously, because the stream object is unavailable as soon as the remote close happens.
The ASSERT is there to check my understanding about the clients' threading. It appears that the client saves & operates off of the thread_id and/or dispatcher from the thread that created it (e.g. usage of isThreadSafe()). The child streams pull in the dispatcher from their parent client & post callbacks to it (google_async_client_impl example). In the global client's case, that thread is the main thread.
There was a problem hiding this comment.
The async client in RLQS is created on the main thread, thus onRemoteClose here is called on main thread.
@bsurber Could you add some comments. You can do it in a follow-up PR, to avoid going through the CI again.
|
A general comment that using a shared rate limit client is a performance optimization. I am slightly concerned that removing it will cause a performance degradation and am curious if you have attempted any benchmarking. |
In this case, the rate limiting client is a RateLimitQuotaService client, which operates entirely outside of the critical path with asynchronous flows. All sending of Usage Reports, processing of Responses, cached assignment expirations+fallbacks, etc are handled by the main thread. |
| } | ||
|
|
||
| absl::StatusOr<Grpc::RawAsyncClientPtr> rlqs_stream_client = | ||
| (*rlqs_stream_client_factory)->createUncachedRawAsyncClient(); |
There was a problem hiding this comment.
QQ:
With the change, we are going to create the async client every time, basically disable the async client cache feature?
There was a problem hiding this comment.
Thanks to the persistence work in #40497, this global client object gets created once per combination of domain + rlqs target, and then continues to live so long as at least one filter factory continues to reference that same combination. The unique_ptr guarantees a single async client's creation & deletion on the same timeframe.
|
/retest |
…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <bsurber@google.com> Signed-off-by: Misha Badov <mbadov@google.com>
…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <bsurber@google.com>
Commit Message: The RLQS async stream in
GlobalRateLimitClientImpl(stream_) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free.To get around this race condition, the
GlobalRateLimitClientImplcan instead own itsRawAsyncClient& delete it to guarantee that any of its active streams are cleaned up.Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding
waitForRlqsStream()to check all fake upstream connections for new streams, not just the first.Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR #40497