Skip to content

Zone Aware Routing support#5352

Merged
arkodg merged 44 commits intoenvoyproxy:mainfrom
jukie:zone-aware-routing-init-container
Apr 27, 2025
Merged

Zone Aware Routing support#5352
arkodg merged 44 commits intoenvoyproxy:mainfrom
jukie:zone-aware-routing-init-container

Conversation

@jukie
Copy link
Copy Markdown
Contributor

@jukie jukie commented Feb 26, 2025

What type of PR is this?

What this PR does / why we need it:

This is the second half to #5299 and adds support for Kubernetes Topology Aware Routing and Traffic Distribution via Envoy Zone Aware Routing.

  • Adds a MutatingWebhookConfiguration which reacts to proxy pod Binding resource creations and injects underlying node topology information back to the pod.
  • When Traffic Distribution or Topology Aware Routing are enabled for a Kubernetes Service, EnvoyGateway will inspect the endpoints and include zone information to LocalityLbEndpoints. EnvoyGateway will also then set the LocalityLbConfig to use Zone Aware Routing.

Which issue(s) this PR fixes:

Fixes #1909

Release Notes: Yes

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 69.54545% with 67 lines in your changes missing coverage. Please review.

Project coverage is 65.33%. Comparing base (096cb8d) to head (b677236).
Report is 43 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/topology_injector.go 42.85% 23 Missing and 9 partials ⚠️
internal/provider/kubernetes/kubernetes.go 0.00% 14 Missing and 2 partials ⚠️
internal/cmd/certgen.go 52.00% 9 Missing and 3 partials ⚠️
internal/xds/translator/cluster.go 96.39% 3 Missing and 1 partial ⚠️
internal/gatewayapi/route.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5352      +/-   ##
==========================================
+ Coverage   65.19%   65.33%   +0.13%     
==========================================
  Files         214      220       +6     
  Lines       34321    35180     +859     
==========================================
+ Hits        22377    22986     +609     
- Misses      10591    10777     +186     
- Partials     1353     1417      +64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jukie jukie mentioned this pull request Mar 3, 2025
@jukie jukie force-pushed the zone-aware-routing-init-container branch from 0529343 to 61f063a Compare March 14, 2025 01:08
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Mar 14, 2025

Still WIP but this example "works". A challenge is the way that Envoy internally tries to keep an overall equal request level between other Proxy instances so I think we'll need to add the EnvoyProxy service as an EDS cluster in bootstrap and then have the endpoints populated afterwards.

@AlexKrudu
Copy link
Copy Markdown
Contributor

AlexKrudu commented Mar 28, 2025

@jukie Hi! I am considering using this feature for my edge-proxy envoy setup and I was wondering if you could give me some clarifications on how this feature's intent and perhaps some thoughts on your implementation.
As far as I understand envoy developers intent - this feature is meant to be tracking ratio of both the callee and caller service (clusters) local workload to the total workload. And the result would be somewhat equal request rate to the all of upstream service instances in all of the localities. Your implementation considers Envoy itself as a originating cluster with the static size of 1. Isn't that the reason that this setup doesn't achieve the goal of leaving of requests in the local zone? Because from envoy's point of view local zone is the only zone requests are coming from and that's why it would try to spread it equally between all of the zones. You've mentioned adding the EnvoyProxy service as an EDS cluster in bootstrap. Would you mind elaborating on how that would tackle the described problem? And thank you for you work!

@AlexKrudu
Copy link
Copy Markdown
Contributor

AlexKrudu commented Mar 28, 2025

Because for now my thoughts are that zone-aware routing does not quite align with the edge proxy setup :)

@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Mar 28, 2025

@AlexKrudu , @jukie is working in upstream to change the behavior for ingress envoyproxy/envoy#38756, which can be leveraged here

@AlexKrudu
Copy link
Copy Markdown
Contributor

Thanks @arkodg! But with that force_locality_direct_routing feature enabled, wouldn't the zone-aware rouing feature equivalent to just setting priority P(0) for the endpoints in the same zone as the envoy? And P(1) for the rest for instance

@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Mar 28, 2025

@AlexKrudu we're trying to achieve zone precedence/priority w/o explicitly using the priority feature, and adding another level of precedence within priority (0)

you could achieve the same with priority, but you'd have to build a different CDS + EDS for every envoy in every zone which is non trivial + requires more resources like memory

@jukie jukie force-pushed the zone-aware-routing-init-container branch 3 times, most recently from c819e9e to 10280ec Compare April 6, 2025 22:55
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 7, 2025

/retest

@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 8, 2025

Is there any preference for handling EnvoyProxy RBAC? Will need permissions to get nodes in the cluster for zone discovery

  • A: Add Role/RoleBinding management permissions to EG
  • B: All EnvoyProxy deployments use the same predefined ServiceAccount and RBAC is setup via helm
  • C: Predefined Role/RoleBinding via helm chart with a ref target of some-sa-prefix-*

@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Apr 11, 2025

hey @jukie vote for B
we already create an SA per fleet

func (r *ResourceRender) ServiceAccount() (*corev1.ServiceAccount, error) {

we'll need to also create a Role and RoleBinding here going forward if zone aware routing is enabled in EnvoyProxy

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie force-pushed the zone-aware-routing-init-container branch from bfe574f to 263a7ba Compare April 16, 2025 20:27
jukie added 2 commits April 16, 2025 15:04
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie changed the title tmp-start Zone Aware Routing support Apr 16, 2025
@jukie jukie marked this pull request as ready for review April 16, 2025 22:06
@jukie jukie requested a review from a team as a code owner April 16, 2025 22:06
@jukie jukie requested a review from arkodg April 16, 2025 22:10
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 16, 2025

Hey @arkodg, this is ready for review now!

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 26, 2025

/retest

@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 26, 2025

/retest

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie force-pushed the zone-aware-routing-init-container branch from 77be171 to f182175 Compare April 26, 2025 17:40
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 26, 2025

Because I've added multiple nodes to the E2E tests the UDS pods need to include affinity for the proxy pods

@jukie jukie requested a review from arkodg April 26, 2025 17:41
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 26, 2025

/retest

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 27, 2025

/retest

@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 27, 2025

/retest

}

rawPatch := client.RawPatch(types.JSONPatchType, []byte(patch))
if err := m.Patch(ctx, pod, rawPatch); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to add some webhook metrics in the following PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding metrics over here - #5835

Copy link
Copy Markdown
Member

@zirain zirain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gernally LGTM.

zirain
zirain previously approved these changes Apr 27, 2025
jukie and others added 2 commits April 26, 2025 23:20
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
jukie and others added 2 commits April 27, 2025 01:09
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Apr 27, 2025

/retest

@arkodg arkodg merged commit 49a0b22 into envoyproxy:main Apr 27, 2025
55 of 60 checks passed
@jukie jukie deleted the zone-aware-routing-init-container branch April 27, 2025 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Locality Based Routing Support

4 participants