Skip to content

feat: switch to sidecar+UDS extproc#629

Merged
yuzisun merged 42 commits intomainfrom
sidecarstyle
May 28, 2025
Merged

feat: switch to sidecar+UDS extproc#629
yuzisun merged 42 commits intomainfrom
sidecarstyle

Conversation

@mathetake
Copy link
Copy Markdown
Member

@mathetake mathetake commented May 21, 2025

Commit Message

This commit refactors the internal on how the ext proc is deployed. Specifically, this switches to insert the ext proc container as a sidecar container of Envoy pods created by Envoy Gateway. This is another large refactoring that turned out necessary for #599. This utilizes the mutating webhook to insert the extproc container Envoy pods.

Making the extproc as as sidecar means that we now have a one-to-one mapping between Gateway and the extproc hence this naturally resolves the previously known limitation #509 and now users can attach multiple AIGatewayRoute(s) to one Gateway.

Implementation note: since the volume mounts only work in the namespace-scoped way, use-created secrets (like API Keys) cannot be mounted by the extproc as it runs in "envoy-gateway-system" namespace. To resolve this, now the controller reads the secret and embed the read credentials into the "extproc secret" (which is previously known as "extproc configmap") together with routing, matching and backend information. That secret is written in the "envoy-gateway-system" namespace hence it can be mounted by the extproc container.

Related Issues/PRs (if applicable)

Resolves #509
Resolves #621

mathetake added a commit that referenced this pull request May 21, 2025
**Commit Message**

This removes the inference extension related code from the controller to
reduce the size of the refactoring PR #629. We need to do the complete
redo on inference extension after #629, so this doesn't mean that we
drop the support for it.

---------

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
mathetake added 2 commits May 21, 2025 16:13
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake
Copy link
Copy Markdown
Member Author

ok all good except that i need to backfill the unit tests on the new gateway reconciler and gateway pod mutators. other than that, ready

mathetake added 3 commits May 21, 2025 19:49
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake mathetake marked this pull request as ready for review May 22, 2025 03:39
@mathetake mathetake requested a review from a team as a code owner May 22, 2025 03:39
Comment on lines +320 to +323
// Note: when multiple AIGatewayRoute resources are attached to the same Gateway, and each
// AIGatewayRoute has a different resource configuration, the ai-gateway will pick one of them
// to configure the resource requirements of the external processor container.
//
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want some ideas on this - maybe we might want a "gateway-level" CRD to define things like this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add the resource requirement to the ai gateway controller flags.

@mathetake
Copy link
Copy Markdown
Member Author

i think this lacks lots of code comments compared to the usual changes i did ...

@mathetake
Copy link
Copy Markdown
Member Author

i think i need to add one additional end to end test where we apply multiple AIGatewayRoutes

mathetake added 4 commits May 22, 2025 10:52
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Copy link
Copy Markdown
Contributor

@aabchoo aabchoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments still reviewing

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake
Copy link
Copy Markdown
Member Author

i will refactor around how certs are created etc tomorrow, following a nice advice by @arkodg

Copy link
Copy Markdown
Contributor

@aabchoo aabchoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re:CMD


var slogLevel slog.Level
if err = slogLevel.UnmarshalText([]byte(*extProcLogLevelPtr)); err != nil {
if err := slogLevel.UnmarshalText([]byte(*extProcLogLevelPtr)); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: why use := when err is defined a few lines above?

// PostVirtualHostModify allows an extension to modify the virtual hosts in the xDS config.
//
// Currently, this replaces the route that has "x-ai-eg-selected-route" pointing to "original_destination_cluster" to route to the original destination cluster.
func (s *Server) PostVirtualHostModify(_ context.Context, req *egextension.PostVirtualHostModifyRequest) (*egextension.PostVirtualHostModifyResponse, error) {
Copy link
Copy Markdown
Member Author

@mathetake mathetake May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i should've left note: this code here was not used at all after InfExt controller code removal #632, and I didn't want to fix the test for this unused code hence I removed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// httpHeaderName: x-ai-eg-original-dst
// useHttpHeader: true
// type: ORIGINAL_DST
if !extProcUDSExist {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this uds cluster creation was necessary to attach the UDS cluster to the upstream filter below since now the extension policy was created per-gateway, not per-route, so we needed to create a unique name UDS cluster that can be attached below

mathetake added 2 commits May 27, 2025 08:13
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake
Copy link
Copy Markdown
Member Author

ok it seems e2e is flaky... i need to fix

mathetake added 3 commits May 27, 2025 08:33
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake
Copy link
Copy Markdown
Member Author

hmm still flaky...

mathetake added 2 commits May 27, 2025 09:20
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake
Copy link
Copy Markdown
Member Author

@yuzisun @wengyao04 PTAL

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@mathetake
Copy link
Copy Markdown
Member Author

man flake

mathetake added 9 commits May 27, 2025 10:03
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
# Conflicts:
#	internal/controller/ai_gateway_route.go
#	internal/controller/ai_gateway_route_test.go
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
@yuzisun yuzisun merged commit 8d9c8e0 into main May 28, 2025
17 checks passed
@yuzisun yuzisun deleted the sidecarstyle branch May 28, 2025 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request times out with the upstream filter when running multiple extproc instances Suport for multiple AIGatewayRoutes per Gateway

5 participants