redis: Add zone-aware routing support for Redis Cluster proxy#43012
redis: Add zone-aware routing support for Redis Cluster proxy#43012bellatoris wants to merge 16 commits intoenvoyproxy:mainfrom
Conversation
|
Hi @bellatoris, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
84f9f19 to
b7b0088
Compare
This change implements zone-aware routing for Redis Cluster, allowing read requests to be routed to replicas in the same availability zone as the client. Key changes: - Add enable_zone_discovery config option to redis_cluster.proto - Add az_affinity and az_affinity_replicas_and_primary read policies - Implement INFO command-based zone discovery during cluster slot updates - Store zone info in host locality for standard Envoy locality handling - RedisShard groups replicas by zone for efficient zone-aware routing Zone Discovery Flow: 1. CLUSTER SLOTS response triggers zone discovery when enabled 2. INFO command sent to each unique node to get availability_zone 3. Zones stored in host->locality().zone() when hosts are created 4. RedisShard reads zone from host locality, groups replicas by zone Read Policies: - AzAffinity: local replicas -> any replica -> primary - AzAffinityReplicasAndPrimary: local replicas -> local primary -> any replica -> primary Note: This feature currently works with Valkey only. Valkey exposes availability_zone in its INFO response. Standard Redis does not support this field. Signed-off-by: Doogie Min <doogie.min@sendbird.com>
b7b0088 to
085d27f
Compare
nezdolik
left a comment
There was a problem hiding this comment.
Here is partial review, will need to do few more rounds
| RedisCluster::RedisHost::create(Upstream::ClusterInfoConstSharedPtr cluster, | ||
| const std::string& hostname, | ||
| Network::Address::InstanceConstSharedPtr address, | ||
| RedisCluster& parent, bool primary, const std::string& zone) { |
There was a problem hiding this comment.
please reuse existing create method and add zone as absl::optional<string>
| std::shared_ptr<const envoy::config::core::v3::Locality> | ||
| RedisCluster::RedisHost::makeLocalityWithZone( | ||
| const envoy::config::core::v3::Locality& base_locality, const std::string& zone) { | ||
| auto locality = std::make_shared<envoy::config::core::v3::Locality>(base_locality); |
There was a problem hiding this comment.
should locality be created if zone is empty? if not then you can return absl::StatusOr<std::shared_ptr<const envoy::config::core::v3::Locality>>
|
|
||
| // Collect unique node addresses | ||
| absl::flat_hash_set<std::string> unique_addresses; | ||
| absl::flat_hash_map<std::string, bool> address_is_primary; // track if address is primary |
There was a problem hiding this comment.
would not it be sufficient just to have this map and drop unique_addresses set? since address_is_primary map will also only contain unique addresses (as keys)
| static_cast<int>(value->type())); | ||
| } | ||
|
|
||
| uint32_t remaining = pending_zone_requests_.fetch_sub(1) - 1; |
There was a problem hiding this comment.
why do we need to both atomically decrement pending_zone_requests_ and do - 1 ?
| // availability_zone:us-east-1a | ||
| // ... | ||
| // Note: Standard Redis does not expose availability_zone. | ||
| std::string |
|
And looks like tests need to be added/improved, as coverage check is failing: |
api/envoy/extensions/filters/network/redis_proxy/v3/redis_proxy.proto
Outdated
Show resolved
Hide resolved
Rename the read policy enum values for clarity: - AZ_AFFINITY -> LOCAL_ZONE_AFFINITY - AZ_AFFINITY_REPLICAS_AND_PRIMARY -> LOCAL_ZONE_AFFINITY_REPLICAS_AND_PRIMARY Also includes minor code cleanup: - Remove redundant unique_addresses set (use address_is_primary map keys instead) - Use try_emplace instead of find+insert pattern - Use structured bindings for map iteration - Add clarifying comments for fetch_sub return value semantics - Use static constexpr for string literal constant Signed-off-by: Doogie Min <doogie.min@sendbird.com>
|
/lgtm api |
fd233b4 to
2b3aa82
Compare
Add tests to improve code coverage for the Redis cluster zone discovery feature and new read policies: - Add parseAvailabilityZone tests for various INFO response formats - Add ZoneDiscoveryConfig test fixture for zone discovery testing - Add tests for LOCAL_ZONE_AFFINITY and LOCAL_ZONE_AFFINITY_REPLICAS_AND_PRIMARY read policies in client_impl_test - Add friend declaration to RedisDiscoverySession for test access These tests cover previously untested code paths in redis_cluster.cc and client_impl.cc to help meet coverage thresholds. Signed-off-by: Doogie Min <doogie.min@sendbird.com>
2db9fa9 to
186289b
Compare
|
/wait |
…alidation Add tests to improve coverage for source/extensions/clusters/redis/ from 96.1% to ~97.2% (threshold: 96.6%): - LoadBalancerStubMethods: exercises peekAnotherHost, selectExistingConnection, and lifetimeCallbacks stub methods on RedisClusterLoadBalancer (+9 lines) - RedisErrorResponsePortOverflow: validates that CLUSTER SLOTS responses with port > 0xffff are rejected by validateCluster (+2 lines) Signed-off-by: Doogie Min <doogie.min@sendbird.com>
|
@nezdolik done |
…ware-routing # Conflicts: # changelogs/current.yaml
nezdolik
left a comment
There was a problem hiding this comment.
overall lgtm, few nits while we wait for api review
| absl::Status creation_status = absl::OkStatus(); | ||
| auto ret = std::unique_ptr<RedisCluster::RedisHost>( | ||
| new RedisCluster::RedisHost(cluster, hostname, address, parent, primary, creation_status)); | ||
| auto ret = std::unique_ptr<RedisCluster::RedisHost>(new RedisCluster::RedisHost( |
There was a problem hiding this comment.
while we are waiting for api review, could we switch std::unique_ptr<RedisCluster::RedisHost>(new RedisCluster::RedisHost( to std::make_unique<RedisCluster::RedisHost>?
| const std::string& primaryZone() const { return primary_zone_; } | ||
|
|
||
| // Get replicas in a specific zone. Returns nullptr if no replicas in that zone. | ||
| const Upstream::HostSetImpl* replicasInZone(const std::string& zone) const { |
There was a problem hiding this comment.
nit: return unique_ptr here instead of raw pointer
- Switch to std::make_unique in RedisHost::create() by making constructor public - Return const unique_ptr& instead of raw pointer from replicasInZone() Signed-off-by: Doogie Min <doogie.min@sendbird.com>
- Switch to std::make_unique in RedisHost::create() by making constructor public - Return const unique_ptr& instead of raw pointer from replicasInZone() Signed-off-by: Doogie Min <doogie.min@sendbird.com>
5cd4e9b to
13f2bef
Compare
Done, could you review again @nezdolik @markdroth ? |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces zone-aware routing for the Redis Cluster proxy, a valuable feature for reducing cross-zone traffic. The changes are well-structured, touching the API, core discovery and load balancing logic, and connection pooling. The implementation includes a new zone discovery mechanism using the INFO command, which is cleanly integrated into the existing cluster discovery session. The load balancing logic is extended to support two new read policies with clear fallback semantics. The addition of comprehensive unit and integration tests covering both the discovery and load balancing aspects is commendable. I found one high-severity memory safety issue that should be addressed.
| void onZoneResponse(const std::string& address, bool is_primary, | ||
| NetworkFilters::Common::Redis::RespValuePtr&& value); | ||
| void onZoneDiscoveryFailure(const std::string& address, bool is_primary); |
There was a problem hiding this comment.
There is a potential use-after-free bug in the implementation of onZoneResponse and onZoneDiscoveryFailure. These methods are called with a const std::string& that refers to a member of a ZoneDiscoveryCallback object. Inside these methods, zone_callbacks_.erase(address) is called, which destroys the callback object and invalidates the address reference. Using this dangling reference as the key for the erase operation is undefined behavior.
A comment in onZoneResponse indicates awareness of this lifetime issue, but the implementation is still susceptible to this bug.
To fix this, you should pass the address by value to ensure the string's lifetime is extended until the function returns. The call sites in ZoneDiscoveryCallback will then create a copy, which is the desired behavior.
| void onZoneResponse(const std::string& address, bool is_primary, | |
| NetworkFilters::Common::Redis::RespValuePtr&& value); | |
| void onZoneDiscoveryFailure(const std::string& address, bool is_primary); | |
| void onZoneResponse(std::string address, bool is_primary, | |
| NetworkFilters::Common::Redis::RespValuePtr&&& value); | |
| void onZoneDiscoveryFailure(std::string address, bool is_primary); |
|
@markdroth ptal. |
…ing hazard onZoneResponse and onZoneDiscoveryFailure receive address as const ref aliasing ZoneDiscoveryCallback::address_. When zone_callbacks_.erase(address) destroys the callback, the reference becomes dangling during the erase call. Pass by value to ensure the string outlives the erase. Signed-off-by: Doogie Min <doogie.min@sendbird.com>
Signed-off-by: Doogie Min <doogie.min@sendbird.com>
448cfc5 to
94f0de7
Compare
Signed-off-by: Doogie Min <doogie.min@sendbird.com>
|
/retest |
|
@wbpcode could you do an api shepherd review? |
Summary
Zone-aware routing reduces cross-zone network traffic and latency by preferring replicas in the same availability zone as the client. This is particularly valuable in cloud environments where cross-zone data transfer incurs additional costs.
New Configuration
redis_cluster.proto:
redis_proxy.proto - New ReadPolicy values:
How It Works
Limitations
Risk Level: Low
This is an opt-in feature that requires explicit configuration (enable_zone_discovery: true and read_policy: AZ_AFFINITY). Existing deployments are unaffected.
Testing
Related Issues
Closes #43011