Skip to content

redis: Add zone-aware routing support for Redis Cluster proxy#43012

Open
bellatoris wants to merge 16 commits intoenvoyproxy:mainfrom
bellatoris:doogie/redis-zone-aware-routing
Open

redis: Add zone-aware routing support for Redis Cluster proxy#43012
bellatoris wants to merge 16 commits intoenvoyproxy:mainfrom
bellatoris:doogie/redis-zone-aware-routing

Conversation

@bellatoris
Copy link
Copy Markdown
Contributor

Summary

Zone-aware routing reduces cross-zone network traffic and latency by preferring replicas in the same availability zone as the client. This is particularly valuable in cloud environments where cross-zone data transfer incurs additional costs.

New Configuration

redis_cluster.proto:

  // Enable zone discovery via INFO command
  google.protobuf.BoolValue enable_zone_discovery = 7;

redis_proxy.proto - New ReadPolicy values:

  - AZ_AFFINITY: Prefer same-zone replicas → any replica → primary
  - AZ_AFFINITY_REPLICAS_AND_PRIMARY: Prefer same-zone replicas → same-zone primary → any replica → primary

How It Works

  ┌─────────────────────────────────────────────────────────┐
  │              CLUSTER SLOTS Response                     │
  └─────────────────────────────────────────────────────────┘
                           │
                           ▼
  ┌─────────────────────────────────────────────────────────┐
  │     Zone Discovery (if enable_zone_discovery=true)      │
  │     Send INFO to each node → parse availability_zone    │
  └─────────────────────────────────────────────────────────┘
                           │
                           ▼
  ┌─────────────────────────────────────────────────────────┐
  │     Create hosts with zone in locality.zone()           │
  │     RedisShard groups replicas by zone                  │
  └─────────────────────────────────────────────────────────┘
                           │
                           ▼
  ┌─────────────────────────────────────────────────────────┐
  │     Request Routing                                     │
  │     Client zone from node.locality.zone (bootstrap)     │
  │     AZ_AFFINITY: local replicas → any replica → primary │
  └─────────────────────────────────────────────────────────┘

Limitations

  • Valkey only: Zone discovery relies on availability_zone field in INFO response, which is exposed by Valkey but not standard Redis.
  • Client zone must be configured via node.locality.zone in Envoy bootstrap config.

Risk Level: Low

This is an opt-in feature that requires explicit configuration (enable_zone_discovery: true and read_policy: AZ_AFFINITY). Existing deployments are unaffected.

Testing

  • Added comprehensive unit tests for zone-aware load balancing

Related Issues

Closes #43011

@repokitteh-read-only
Copy link
Copy Markdown

Hi @bellatoris, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #43012 was opened by bellatoris.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #43012 was opened by bellatoris.

see: more, trace.

@bellatoris bellatoris force-pushed the doogie/redis-zone-aware-routing branch 2 times, most recently from 84f9f19 to b7b0088 Compare January 15, 2026 07:57
This change implements zone-aware routing for Redis Cluster, allowing read
requests to be routed to replicas in the same availability zone as the client.

Key changes:
- Add enable_zone_discovery config option to redis_cluster.proto
- Add az_affinity and az_affinity_replicas_and_primary read policies
- Implement INFO command-based zone discovery during cluster slot updates
- Store zone info in host locality for standard Envoy locality handling
- RedisShard groups replicas by zone for efficient zone-aware routing

Zone Discovery Flow:
1. CLUSTER SLOTS response triggers zone discovery when enabled
2. INFO command sent to each unique node to get availability_zone
3. Zones stored in host->locality().zone() when hosts are created
4. RedisShard reads zone from host locality, groups replicas by zone

Read Policies:
- AzAffinity: local replicas -> any replica -> primary
- AzAffinityReplicasAndPrimary: local replicas -> local primary -> any replica -> primary

Note: This feature currently works with Valkey only. Valkey exposes
availability_zone in its INFO response. Standard Redis does not support this field.

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@bellatoris bellatoris force-pushed the doogie/redis-zone-aware-routing branch from b7b0088 to 085d27f Compare January 15, 2026 07:59
@nezdolik nezdolik self-assigned this Jan 16, 2026
Copy link
Copy Markdown
Member

@nezdolik nezdolik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is partial review, will need to do few more rounds

RedisCluster::RedisHost::create(Upstream::ClusterInfoConstSharedPtr cluster,
const std::string& hostname,
Network::Address::InstanceConstSharedPtr address,
RedisCluster& parent, bool primary, const std::string& zone) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reuse existing create method and add zone as absl::optional<string>

std::shared_ptr<const envoy::config::core::v3::Locality>
RedisCluster::RedisHost::makeLocalityWithZone(
const envoy::config::core::v3::Locality& base_locality, const std::string& zone) {
auto locality = std::make_shared<envoy::config::core::v3::Locality>(base_locality);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should locality be created if zone is empty? if not then you can return absl::StatusOr<std::shared_ptr<const envoy::config::core::v3::Locality>>


// Collect unique node addresses
absl::flat_hash_set<std::string> unique_addresses;
absl::flat_hash_map<std::string, bool> address_is_primary; // track if address is primary
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not it be sufficient just to have this map and drop unique_addresses set? since address_is_primary map will also only contain unique addresses (as keys)

static_cast<int>(value->type()));
}

uint32_t remaining = pending_zone_requests_.fetch_sub(1) - 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to both atomically decrement pending_zone_requests_ and do - 1 ?

// availability_zone:us-east-1a
// ...
// Note: Standard Redis does not expose availability_zone.
std::string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add const

@nezdolik
Copy link
Copy Markdown
Member

And looks like tests need to be added/improved, as coverage check is failing:

FAILED: Directories not meeting coverage thresholds:
  ✗ source/extensions/clusters/redis: 84.7% (threshold: 96.6%)
  ✗ source/extensions/filters/network/common/redis: 96.4% (threshold: 96.6%)

Rename the read policy enum values for clarity:
- AZ_AFFINITY -> LOCAL_ZONE_AFFINITY
- AZ_AFFINITY_REPLICAS_AND_PRIMARY -> LOCAL_ZONE_AFFINITY_REPLICAS_AND_PRIMARY

Also includes minor code cleanup:
- Remove redundant unique_addresses set (use address_is_primary map keys instead)
- Use try_emplace instead of find+insert pattern
- Use structured bindings for map iteration
- Add clarifying comments for fetch_sub return value semantics
- Use static constexpr for string literal constant

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@markdroth
Copy link
Copy Markdown
Contributor

/lgtm api

@bellatoris bellatoris force-pushed the doogie/redis-zone-aware-routing branch 9 times, most recently from fd233b4 to 2b3aa82 Compare January 29, 2026 13:11
Add tests to improve code coverage for the Redis cluster zone discovery
feature and new read policies:

- Add parseAvailabilityZone tests for various INFO response formats
- Add ZoneDiscoveryConfig test fixture for zone discovery testing
- Add tests for LOCAL_ZONE_AFFINITY and LOCAL_ZONE_AFFINITY_REPLICAS_AND_PRIMARY
  read policies in client_impl_test
- Add friend declaration to RedisDiscoverySession for test access

These tests cover previously untested code paths in redis_cluster.cc and
client_impl.cc to help meet coverage thresholds.

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@bellatoris bellatoris force-pushed the doogie/redis-zone-aware-routing branch 2 times, most recently from 2db9fa9 to 186289b Compare January 29, 2026 15:39
@nezdolik
Copy link
Copy Markdown
Member

/wait

bellatoris and others added 2 commits March 17, 2026 09:10
…alidation

Add tests to improve coverage for source/extensions/clusters/redis/ from
96.1% to ~97.2% (threshold: 96.6%):

- LoadBalancerStubMethods: exercises peekAnotherHost, selectExistingConnection,
  and lifetimeCallbacks stub methods on RedisClusterLoadBalancer (+9 lines)
- RedisErrorResponsePortOverflow: validates that CLUSTER SLOTS responses with
  port > 0xffff are rejected by validateCluster (+2 lines)

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@bellatoris
Copy link
Copy Markdown
Contributor Author

@nezdolik done

…ware-routing

# Conflicts:
#	changelogs/current.yaml
@bellatoris bellatoris requested a review from markdroth March 17, 2026 04:51
Copy link
Copy Markdown
Member

@nezdolik nezdolik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, few nits while we wait for api review

absl::Status creation_status = absl::OkStatus();
auto ret = std::unique_ptr<RedisCluster::RedisHost>(
new RedisCluster::RedisHost(cluster, hostname, address, parent, primary, creation_status));
auto ret = std::unique_ptr<RedisCluster::RedisHost>(new RedisCluster::RedisHost(
Copy link
Copy Markdown
Member

@nezdolik nezdolik Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we are waiting for api review, could we switch std::unique_ptr<RedisCluster::RedisHost>(new RedisCluster::RedisHost( to std::make_unique<RedisCluster::RedisHost>?

const std::string& primaryZone() const { return primary_zone_; }

// Get replicas in a specific zone. Returns nullptr if no replicas in that zone.
const Upstream::HostSetImpl* replicasInZone(const std::string& zone) const {
Copy link
Copy Markdown
Member

@nezdolik nezdolik Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return unique_ptr here instead of raw pointer

bellatoris added a commit to bellatoris/envoy that referenced this pull request Mar 20, 2026
- Switch to std::make_unique in RedisHost::create() by making constructor public
- Return const unique_ptr& instead of raw pointer from replicasInZone()

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
- Switch to std::make_unique in RedisHost::create() by making constructor public
- Return const unique_ptr& instead of raw pointer from replicasInZone()

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@bellatoris bellatoris force-pushed the doogie/redis-zone-aware-routing branch from 5cd4e9b to 13f2bef Compare March 20, 2026 17:39
@bellatoris
Copy link
Copy Markdown
Contributor Author

overall lgtm, few nits while we wait for api review

Done, could you review again @nezdolik @markdroth ?

@bellatoris bellatoris requested a review from nezdolik March 20, 2026 18:02
@nezdolik
Copy link
Copy Markdown
Member

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces zone-aware routing for the Redis Cluster proxy, a valuable feature for reducing cross-zone traffic. The changes are well-structured, touching the API, core discovery and load balancing logic, and connection pooling. The implementation includes a new zone discovery mechanism using the INFO command, which is cleanly integrated into the existing cluster discovery session. The load balancing logic is extended to support two new read policies with clear fallback semantics. The addition of comprehensive unit and integration tests covering both the discovery and load balancing aspects is commendable. I found one high-severity memory safety issue that should be addressed.

Comment on lines +277 to +279
void onZoneResponse(const std::string& address, bool is_primary,
NetworkFilters::Common::Redis::RespValuePtr&& value);
void onZoneDiscoveryFailure(const std::string& address, bool is_primary);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential use-after-free bug in the implementation of onZoneResponse and onZoneDiscoveryFailure. These methods are called with a const std::string& that refers to a member of a ZoneDiscoveryCallback object. Inside these methods, zone_callbacks_.erase(address) is called, which destroys the callback object and invalidates the address reference. Using this dangling reference as the key for the erase operation is undefined behavior.

A comment in onZoneResponse indicates awareness of this lifetime issue, but the implementation is still susceptible to this bug.

To fix this, you should pass the address by value to ensure the string's lifetime is extended until the function returns. The call sites in ZoneDiscoveryCallback will then create a copy, which is the desired behavior.

Suggested change
void onZoneResponse(const std::string& address, bool is_primary,
NetworkFilters::Common::Redis::RespValuePtr&& value);
void onZoneDiscoveryFailure(const std::string& address, bool is_primary);
void onZoneResponse(std::string address, bool is_primary,
NetworkFilters::Common::Redis::RespValuePtr&&& value);
void onZoneDiscoveryFailure(std::string address, bool is_primary);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@nezdolik
Copy link
Copy Markdown
Member

@markdroth ptal.

…ing hazard

onZoneResponse and onZoneDiscoveryFailure receive address as const ref
aliasing ZoneDiscoveryCallback::address_. When zone_callbacks_.erase(address)
destroys the callback, the reference becomes dangling during the erase call.
Pass by value to ensure the string outlives the erase.

Signed-off-by: Doogie Min <doogie.min@sendbird.com>
Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@bellatoris bellatoris force-pushed the doogie/redis-zone-aware-routing branch from 448cfc5 to 94f0de7 Compare March 24, 2026 02:19
Signed-off-by: Doogie Min <doogie.min@sendbird.com>
@bellatoris
Copy link
Copy Markdown
Contributor Author

/retest

@nezdolik
Copy link
Copy Markdown
Member

@wbpcode could you do an api shepherd review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

redis: Support zone-aware routing support for Redis Cluster proxy

4 participants