Skip to content

Add toIP to CiliumLocalRedirectPolicy redirectBackend#41645

Draft
liyihuang wants to merge 7 commits intocilium:mainfrom
liyihuang:pr/liyihuang/lrp_ip_override
Draft

Add toIP to CiliumLocalRedirectPolicy redirectBackend#41645
liyihuang wants to merge 7 commits intocilium:mainfrom
liyihuang:pr/liyihuang/lrp_ip_override

Conversation

@liyihuang
Copy link
Copy Markdown
Contributor

@liyihuang liyihuang commented Sep 13, 2025

see commit message

Fixes: #41671

Add toIP to CiliumLocalRedirectPolicy redirectBackend so it works with latest GKE metadata server

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 13, 2025
@liyihuang liyihuang force-pushed the pr/liyihuang/lrp_ip_override branch from 745c66e to 1c79e79 Compare September 16, 2025 22:59
@liyihuang liyihuang added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Sep 16, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 16, 2025
@liyihuang liyihuang added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. area/lrp Impacts Local Redirect Policy. labels Sep 16, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 16, 2025
@liyihuang liyihuang changed the title Pr/liyihuang/lrp ip override Add overrideIP to CiliumLocalRedirectPolicy redirectBackend Sep 16, 2025
@liyihuang
Copy link
Copy Markdown
Contributor Author

/test

@liyihuang liyihuang force-pushed the pr/liyihuang/lrp_ip_override branch from 1c79e79 to 3fc4c9e Compare September 17, 2025 01:49
@liyihuang
Copy link
Copy Markdown
Contributor Author

/test

@liyihuang liyihuang force-pushed the pr/liyihuang/lrp_ip_override branch from 3fc4c9e to 44669bf Compare September 17, 2025 02:46
@liyihuang
Copy link
Copy Markdown
Contributor Author

/test

@liyihuang liyihuang marked this pull request as ready for review September 17, 2025 11:58
@liyihuang liyihuang requested review from a team as code owners September 17, 2025 11:58
@liyihuang
Copy link
Copy Markdown
Contributor Author

puting back to draft since I think the test failiure is not the flaky CI

@liyihuang liyihuang marked this pull request as draft September 17, 2025 14:52
Copy link
Copy Markdown
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@liyihuang
Copy link
Copy Markdown
Contributor Author

liyihuang commented Sep 18, 2025

I tried to reproduce it locally but the issue is gone on my laptop. it looks like I'm hitting
#31468

I also have the concern if this PR could trigger #31468 more often.

@liyihuang liyihuang marked this pull request as ready for review September 18, 2025 20:50
Copy link
Copy Markdown
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Here are some small doc nits from my side

@liyihuang liyihuang force-pushed the pr/liyihuang/lrp_ip_override branch from 44669bf to 4f9e02a Compare September 19, 2025 02:32
@liyihuang liyihuang force-pushed the pr/liyihuang/lrp_ip_override branch from 0c915b3 to 28d6b2a Compare January 27, 2026 19:42
@liyihuang
Copy link
Copy Markdown
Contributor Author

liyihuang commented Jan 27, 2026

I just tested it with GKE use case with such config

localRedirectPolicies:
  enabled: true
  toIPRange: "1.1.1.0/24"

and LRP config

apiVersion: cilium.io/v2
kind: CiliumLocalRedirectPolicy
metadata:
  name: gke-metadata-server-redirect-80
  namespace: kube-system
spec:
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: "gke-metadata-server"
    toIP: "169.254.169.252"
    toPorts:
    - port: "988"
      protocol: "TCP"
  redirectFrontend:
    addressMatcher:
      ip: "169.254.169.254"
      toPorts:
      - port: "80"
        protocol: "TCP"

we can get the following warning to reject the lrp config

time=2026-01-27T19:02:23.644462797Z level=warn msg="Rejecting malformed CiliumLocalRedirectPolicy" module=agent.controlplane.local-redirect-policies k8sNamespace=kube-system name=gke-metadata-server-redirect-80 error="ToIP 169.254.169.252 is not within the allowed range 1.1.1.0/24"

in the end of the day, people need to manage what people can access for API or config. otherwise, people can always override the helm values to access whatever they want.

To me this is tighter from LRP perspective but no difference from cluster perspective since people always can protect what can be accessed from the network policy

Add a new Helm value `localRedirectPolicies.toIPRange` to configure the allowed IP range for the ToIP field in Local Redirect Policies. The configuration maps to the `lrp-to-ip-range` flag in the Cilium ConfigMap.

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
@liyihuang liyihuang force-pushed the pr/liyihuang/lrp_ip_override branch from 28d6b2a to 4f9f5e8 Compare January 27, 2026 20:55
@liyihuang
Copy link
Copy Markdown
Contributor Author

/test

@joestringer
Copy link
Copy Markdown
Member

FYI this topic came up during the community meeting today, so @jrife , @bowei and I briefly discussed it. The recording is here (starts around 26m51s). Core question there is whether we want to extend this existing feature in a way that is powerful but increases the likelihood of the feature to cause breakage of other features (+ corresponding support burden), or whether we try to improve the abstraction of traffic handling for this use case "service with a local cache". We didn't come to any conclusions, probably this needs a bit more ongoing discussion with folks interested in the topic (at least @ysksuzuki , @liyihuang , @Bigdelle also expressed interest).

@liyihuang
Copy link
Copy Markdown
Contributor Author

I’ve watched the recording and here are my thoughts on LRP.

Looking at the LRP use cases in the documentation (https://docs.cilium.io/en/latest/network/kubernetes/local-redirect-policy/), the most common example seems to be local DNS. When I browse the LRP codebase, I also notice that most of the documentation and implementation haven’t been touched in the last ~5 years, aside from the recent control-plane rewrite.

This makes me think that local caching is the primary (and perhaps only widely used) use case for LRP, and that LRP itself is not broadly adopted in the same way as core features like NetworkPolicy.

If that’s the reality, then changing LRP to be cluster-wide, or introducing a new API for a cluster-wide LRP, feels like overkill. If we assume that the main users of LRP are platform engineers, then I’d argue that platform engineers already have plenty of “footguns” that can break a cluster — this wouldn’t be fundamentally different.

Personally, I think toIPRange is more than sufficient for the use cases discussed, and it could even be removed. Platform engineers can already use CiliumNetworkPolicy (CNP) to block or control traffic when needed.

@joestringer
Copy link
Copy Markdown
Member

I'd like to see some consideration given to what the ideal expression of these node-local caches should look like, and what Cilium's role in that should be. I understand that a node-local cache should be an opt-in feature. The cluster admin needs to deploy it anyway. If there needs to be some way to express to Cilium that there is such a cache and Cilium as the LB needs to behave differently, then that's fine. That's in essence what LRP does today. LRP with Service matchers and Pod matchers provides a very k8s-native way to implement the functionality we need. The question is what happens when we don't have a convenient handle in a k8s resource that expresses "This node-local cache exists on IP:PORT". Can we express that in a more canonical way for cases like this GKE metadata service? That is, natively express it in k8s resources while keeping it managed by cluster admin RBAC? If we can avoid expanding the arbitrary packet field matching functionality in LRP, I think that is a more beneficial path forward given the outstanding issues with addressMatcher.

@ysksuzuki
Copy link
Copy Markdown
Member

This makes me think that local caching is the primary (and perhaps only widely used) use case for LRP, and that LRP itself is not broadly adopted in the same way as core features like NetworkPolicy.

AFAIK, We don't have data on how widely LRP is used, so I'm not sure about it. But even if LRP does have a relatively smaller user base compared to core features, I don't think usage alone should be a factor in deciding how strict the design constraints should be.

If we assume that the main users of LRP are platform engineers, then I;d argue that platform engineers already have plenty of “footguns” that can break a cluster — this wouldn't be fundamentally different.

I agree that platform engineers operate with powerful privileges, but that’s exactly why misconfigurations can have a much larger and more severe impact. Because of that, I think it’s important for APIs like LRP to be designed in a way that minimizes the risk of accidental misconfiguration and provides clear guardrails, rather than assuming that existing “footguns” make additional ones acceptable.


I'd like to clarify the intended semantics of redirectBackend.toIP in this PR.

If a user specifies an in-cluster endpoint (for example, a Service ClusterIP or a Pod IP) as toIP, does the policy allow redirecting traffic to that destination? If so, this seems to break the existing documented constraint that “The namespace of backend pod(s) need to match with that of the policy.” Could you explain why breaking that constraint is considered acceptable or not problematic?

If toIP is not allowed to point to arbitrary in-cluster IPs, then what destinations are allowed? Please explicitly define what ranges or categories of IPs are considered valid targets for toIP.

Also, how is toIP validated against the backend selected by redirectBackend.localEndpointSelector? Is there a mechanism to verify that the specified toIP is actually owned by (or legitimately associated with) the selected backend pod(s)? If there is no such verification, what is the rationale for requiring localEndpointSelector at all?
In that case, it seems the selector could be satisfied by a dummy pod while toIP redirects traffic elsewhere, which would make localEndpointSelector largely ineffective as a scoping mechanism.

@liyihuang
Copy link
Copy Markdown
Contributor Author

This makes me think that local caching is the primary (and perhaps only widely used) use case for LRP, and that LRP itself is not broadly adopted in the same way as core features like NetworkPolicy.

AFAIK, We don't have data on how widely LRP is used, so I'm not sure about it. But even if LRP does have a relatively smaller user base compared to core features, I don't think usage alone should be a factor in deciding how strict the design constraints should be.

If we assume that the main users of LRP are platform engineers, then I;d argue that platform engineers already have plenty of “footguns” that can break a cluster — this wouldn't be fundamentally different.

I agree that platform engineers operate with powerful privileges, but that’s exactly why misconfigurations can have a much larger and more severe impact. Because of that, I think it’s important for APIs like LRP to be designed in a way that minimizes the risk of accidental misconfiguration and provides clear guardrails, rather than assuming that existing “footguns” make additional ones acceptable.

I'd like to clarify the intended semantics of redirectBackend.toIP in this PR.

If a user specifies an in-cluster endpoint (for example, a Service ClusterIP or a Pod IP) as toIP, does the policy allow redirecting traffic to that destination? If so, this seems to break the existing documented constraint that “The namespace of backend pod(s) need to match with that of the policy.” Could you explain why breaking that constraint is considered acceptable or not problematic?

If toIP is not allowed to point to arbitrary in-cluster IPs, then what destinations are allowed? Please explicitly define what ranges or categories of IPs are considered valid targets for toIP.

Also, how is toIP validated against the backend selected by redirectBackend.localEndpointSelector? Is there a mechanism to verify that the specified toIP is actually owned by (or legitimately associated with) the selected backend pod(s)? If there is no such verification, what is the rationale for requiring localEndpointSelector at all? In that case, it seems the selector could be satisfied by a dummy pod while toIP redirects traffic elsewhere, which would make localEndpointSelector largely ineffective as a scoping mechanism.

Thanks for reminding me about that.

I overall agree that we shouldn't give the user another footgun because they already have some. yes, the current implementation will break what you mentioned there

I think more proper way to do the validation GKE/cache use case is to validate it through cilium internal info.

Here is my thoughts

  • we check the backend pod if they are using HostNetwork.
  • if the backend pod is in the HostNetwork, we can use node-addresses or devices to see if the toIP is on one of those IP addresses.
  • Only if toIP is part of those, we override it.

In this way, we limit toIP is only used when the backend pod is on HostNetwork and toIP is configured in host network NS which is the same network NS with cilium agent. the normal pod should use the IP from status anyway.

@joestringer @ysksuzuki Please let me know how you think about it. If we all agree, I can drop my previous manual validation and implement this method.

@joestringer
Copy link
Copy Markdown
Member

  • (propose to) we check the backend pod if they are using HostNetwork.
  • if the backend pod is in the HostNetwork, we can use node-addresses or devices to see if the toIP is on one of those IP addresses.
  • Only if toIP is part of those, we override it.

In this way, we limit toIP is only used when the backend pod is on HostNetwork and toIP is configured in host network NS which is the same network NS with cilium agent. the normal pod should use the IP from status anyway.

Do you think we could either infer the right IP or get the user to explicitly advertise the IP to be used as part of their hostNetwork daemonset then skip the toIP match in the LRP altogether in favor of a backend selector based on labels? In the end it feels like the user is already controlling the intent around which IPs are used for this node-local version of the service when they deploy the app. So then why do we need the user to duplicate this information in LRP and ensure it lines up?

I guess the short answer is this clarifies which IP to use if the IP is ambiguous, but then I feel like it's better to set up a clear contract with the user that they should explicitly declare that IP as metadata in the deployment/daemonset rather than configuring it in multiple places and trying to make it line up.

On one level this feels somewhat of a trivial question of whether to host the "source of truth" information (which IP) in one resource or two, but on the other hand if we can make the interfaces clearer and reduce the likelihood of future incompatibilities / footguns by avoiding this extra configuration altogether, then that feels like it might be ultimately simpler to operate & maintain.

@liyihuang
Copy link
Copy Markdown
Contributor Author

liyihuang commented Jan 29, 2026

Do you think we could either infer the right IP or get the user to explicitly advertise the IP to be used as part of their hostNetwork daemonset then skip the toIP match in the LRP altogether in favor of a backend selector based on labels? In the end it feels like the user is already controlling the intent around which IPs are used for this node-local version of the service when they deploy the app. So then why do we need the user to duplicate this information in LRP and ensure it lines up?

I believe it's case-by-case. If users have full control over the Deployment or DaemonSet, we can simply read the configuration.

However, in managed Kubernetes environments (like GKE), users often deploy components via a managed console where they don't have direct control over the manifests. This has recently become an issue because the GKE metadata server started listening on a specific IP address following a recent upgrade. They set up iptables rules for redirection, which breaks things for Cilium users who are utilizing eBPF host routing.

If there were a contract between Cilium and other providers (like GKE) to use specific annotations to indicate the preferred IP for these special setups, it would be much easier for both us and the users

@Bigdelle Is this possible?

@Bigdelle
Copy link
Copy Markdown
Contributor

Do you think we could either infer the right IP or get the user to explicitly advertise the IP to be used as part of their hostNetwork daemonset then skip the toIP match in the LRP altogether in favor of a backend selector based on labels? In the end it feels like the user is already controlling the intent around which IPs are used for this node-local version of the service when they deploy the app. So then why do we need the user to duplicate this information in LRP and ensure it lines up?

I believe it's case-by-case. If users have full control over the Deployment or DaemonSet, we can simply read the configuration.

However, in managed Kubernetes environments (like GKE), users often deploy components via a managed console where they don't have direct control over the manifests. This has recently become an issue because the GKE metadata server started listening on a specific IP address following a recent upgrade. They set up iptables rules for redirection, which breaks things for Cilium users who are utilizing eBPF host routing.

If there were a contract between Cilium and other providers (like GKE) to use specific annotations to indicate the preferred IP for these special setups, it would be much easier for both us and the users

@Bigdelle Is this possible?

Generally, I think this approach makes sense. It sounds like it would keep CLRPs focused on their original intent and solve this GKE MDS issue. We will discuss this internally with the proper folks to see what we can land on. I will keep you updated, but this sounds promising and may work on our end.

@liyihuang
Copy link
Copy Markdown
Contributor Author

I also would like to add some notes to make it clear. With some annotation, we still use same mechanism to verify the ownership of the IP address, so people can't use it as the DNAT tool to random IP address.

so the following logis is still valid where we just dont change the frontend IP on LRP but using some annotation from pod.

we check the backend pod if they are using HostNetwork.
if the backend pod is in the HostNetwork, we can use node-addresses or devices to see if the the annotation is on one of those IP addresses.
Only if the annotation is part of those, we override it.

@liyihuang
Copy link
Copy Markdown
Contributor Author

@Bigdelle any update on this?

@Bigdelle
Copy link
Copy Markdown
Contributor

Bigdelle commented Feb 5, 2026

@Bigdelle any update on this?

I'll have an update soon. Thanks for checking in.

@Bigdelle
Copy link
Copy Markdown
Contributor

Bigdelle commented Feb 11, 2026

@liyihuang
I had a chance to gather some feedback on using pod annotations. We generally understand the goal to continue to limit CLRP to a namespace-scoped feature and avoid using it as a general DNAT feature. That being said, we have some concerns about the blast radius and security surface area this introduces.

  1. Using annotations means we now have to restrict RBAC for two separate resources (the CLRP and the pod itself). For something like the GKE MDS, this increases the risk of misconfiguration, and adds another resource that must be synchronized and audited. We would be splitting trust between two resources instead of a single API.
  2. Relying on pod annotations creates a hidden dependency. It makes auditing the cluster harder because the actual redirect target isn't in the policy itself.

We ideally see just one resource (the CLRP) controlling the behavior of the redirection, rather than definitions split between the data plane and control plane, whether through direct fields in the API or annotations in the CLRP. On a related note, to limit the scope and surface of this feature, would it be possible to limit the override ability to just backend pods that have hostNetwork=true? This seems like a proactive way to try to limit the security concerns without compromising on the actual solution this provides. It makes sure that standard pods can't be used as arbitrary redirect targets while supporting the node-level infrastructure (like GKE MDS) that this feature is intended for.

@jrife
Copy link
Copy Markdown
Contributor

jrife commented Feb 11, 2026

On a related note, to limit the scope and surface of this feature, would it be possible to limit the override ability to just backend pods that have hostNetwork=true?

You could probably use BPF socket redirect and lookup helpers to enforce this in the implementation: bpf_sk_lookup to see if there is a socket matching the IP in question then use bpf_sk_assign to direct traffic towards this socket.

@joestringer
Copy link
Copy Markdown
Member

@Bigdelle FYI we're rethinking CLRP also over here: #44138 . Might be worth a perusal. I agree that managing multiple resources is a pain. That's why I'm interested to explore if we even need the LRP abstraction at all or whether we can conform service/NAT handling closer to the official k8s service objects.

@liyihuang
Copy link
Copy Markdown
Contributor Author

@liyihuang I had a chance to gather some feedback on using pod annotations. We generally understand the goal to continue to limit CLRP to a namespace-scoped feature and avoid using it as a general DNAT feature. That being said, we have some concerns about the blast radius and security surface area this introduces.

  1. Using annotations means we now have to restrict RBAC for two separate resources (the CLRP and the pod itself). For something like the GKE MDS, this increases the risk of misconfiguration, and adds another resource that must be synchronized and audited. We would be splitting trust between two resources instead of a single API.
  2. Relying on pod annotations creates a hidden dependency. It makes auditing the cluster harder because the actual redirect target isn't in the policy itself.

We ideally see just one resource (the CLRP) controlling the behavior of the redirection, rather than definitions split between the data plane and control plane, whether through direct fields in the API or annotations in the CLRP. On a related note, to limit the scope and surface of this feature, would it be possible to limit the override ability to just backend pods that have hostNetwork=true? This seems like a proactive way to try to limit the security concerns without compromising on the actual solution this provides. It makes sure that standard pods can't be used as arbitrary redirect targets while supporting the node-level infrastructure (like GKE MDS) that this feature is intended for.

I'm algined with this just from this PR perspective and agree that we should have an easy way for users to consume the API.

You could probably use BPF socket redirect and lookup helpers to enforce this in the implementation: bpf_sk_lookup to see if there is a socket matching the IP in question then use bpf_sk_assign to direct traffic towards this socket.

I personally think validating on control plane is making more sense since that can help us to keep the data path simple

@ysksuzuki
Copy link
Copy Markdown
Member

ysksuzuki commented Mar 12, 2026

@liyihuang @Bigdelle We discussed this internally and wanted to ask: has using --exclude-local-address been considered as an alternative?

With #41275, excluded addresses fall back to the kernel stack instead of being handled by eBPF host routing. This would allow GKE's iptables DNAT rule (169.254.169.254 → 169.254.169.252:988) to work as-is, without needing LRP or any new API fields.

EDIT: Sorry for the confusion. Let me double-check whether --exclude-local-address is actually relevant here.

@Bigdelle
Copy link
Copy Markdown
Contributor

@liyihuang @Bigdelle We discussed this internally and wanted to ask: has using --exclude-local-address been considered as an alternative?

With #41275, excluded addresses fall back to the kernel stack instead of being handled by eBPF host routing. This would allow GKE's iptables DNAT rule (169.254.169.254 → 169.254.169.252:988) to work as-is, without needing LRP or any new API fields.

EDIT: Sorry for the confusion. Let me double-check whether --exclude-local-address is actually relevant here.

I don't want to speak for Liyi, but we'd like to still use host routing, and do the redirection via eBPF. I like your suggestion, and this can definitely be used as a stop-gap solution, but the final goal would be to be able to redirect in host-routing mode to these link-local addresses.

@ysksuzuki
Copy link
Copy Markdown
Member

@Bigdelle Our idea is to pass only Compute Engine MDS traffic to the kernel stack, while continuing to handle all other regular traffic through BPF host routing. Would that still be difficult? Is the goal to replace the GKE MDS iptables DNAT rule (169.254.169.254:80 -> 169.254.169.252:988) with a BPF-based mechanism?

@Bigdelle
Copy link
Copy Markdown
Contributor

@Bigdelle Our idea is to pass only Compute Engine MDS traffic to the kernel stack, while continuing to handle all other regular traffic through BPF host routing. Would that still be difficult? Is the goal to replace the GKE MDS iptables DNAT rule (169.254.169.254:80 -> 169.254.169.252:988) with a BPF-based mechanism?

Yes that's right, the goal would be to replace the DNAT rule with a BPF-based solution if possible. The flag/PR that you linked is very useful for us, but the long-term goal would be to fully transition this redirection to BPF.

@ysksuzuki
Copy link
Copy Markdown
Member

@Bigdelle Our idea is to pass only Compute Engine MDS traffic to the kernel stack, while continuing to handle all other regular traffic through BPF host routing. Would that still be difficult? Is the goal to replace the GKE MDS iptables DNAT rule (169.254.169.254:80 -> 169.254.169.252:988) with a BPF-based mechanism?

Yes that's right, the goal would be to replace the DNAT rule with a BPF-based solution if possible. The flag/PR that you linked is very useful for us, but the long-term goal would be to fully transition this redirection to BPF.

Got it, thanks for confirming. In that case, it seems we would need some kind of LRP extension.

That said, I’m against adding redirectBackend.toIP to allow an arbitrary IP address to be specified. As I mentioned earlier, the main issue is that this would largely undermine the purpose of specifying the backend via redirectBackend.localEndpointSelector.

Unless we can verify that the IP specified in toIP actually belongs to the workload selected by localEndpointSelector, this would effectively allow redirection to work as long as any local workload is selected, which does not seem like a good model to me. Even if we limit this to hostNetwork=true Pods, the problem still remains. In practice, one could still make it work by simply selecting any hostNetwork=true Pod via redirectBackend.localEndpointSelector.

My current view is that it would be better if the target IP could be derived from the Pod information of the workload selected by redirectBackend.localEndpointSelector. Since there does not seem to be an appropriate field for that today, perhaps attaching it via an annotation would be a more practical approach.

For example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gke-mds
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: gke-mds
  template:
    metadata:
      labels:
        app: gke-mds
      annotations:
        cilium.io/redirect-backend-ip: "169.254.169.252"
    spec:
      hostNetwork: true
      containers:
        - name: gke-mds
          image: example/gke-mds:latest
          ports:
            - containerPort: 988
              hostPort: 988

@liyihuang
Copy link
Copy Markdown
Contributor Author

liyihuang commented Mar 13, 2026

I started my PTO yesterday and took a long flight and was late for the party.

@ysksuzuki the purpose of this PR from beginning is to provide a BPF way to redirect the traffic so people can use the ebpf host routing. If we rely on kernel stack, the traffic can just follow GKE's iptable rules.

Unless we can verify that the IP specified in toIP actually belongs to the workload selected by localEndpointSelector, this would effectively allow redirection to work as long as any local workload is selected, which does not seem like a good model to me. Even if we limit this to hostNetwork=true Pods, the problem still remains. In practice, one could still make it work by simply selecting any hostNetwork=true Pod via redirectBackend.localEndpointSelector.

I'm not sure if I understand your concern here. If we only allow pod to do this override with hostNetwork=true(I think that's the only case that we can't get the right IP address from k8s server). we will be able to know if that pod owns the IP since agent is also in the host network NS. In that view, it's not the random DNAT but the DNAT to the IP that owned by the pod.

Here is the logic in my head

  1. Agent check if localEndpointSelector selects the pod with hostNetwork
  2. agent checks the toIP is one of the IPs from LocalNode IP(agent is also in the hostNetwork)
  3. Only if toIP belongs LocalNodeIP then we allow this override.

I think it's safely cover the use case and address those valid security concern.

I personally agree with #41645 (comment) I think it's more confusing for people to configure things at different places

@ysksuzuki
Copy link
Copy Markdown
Member

ysksuzuki commented Mar 13, 2026

@liyihuang Thanks for the detailed explanation.

My concern is specifically about the relationship between localEndpointSelector and toIP. LocalNode IPs belong to the node, not to any particular pod, all hostNetwork=true pods share the same set of node IPs. So even with the LocalNode IP check, a user could select an unrelated hostNetwork=true DaemonSet (e.g. kube-proxy, cilium-agent) via localEndpointSelector, specify toIP: "169.254.169.252", and the validation would still pass. There is no way to verify that the selected workload is the one actually listening on toIP.

I think it's more confusing for people to configure things at different places

With the toIP approach, both CLRP and the GKE MDS pod need to be configured with the same IP address (169.254.169.252), so the configuration is already split across both sides. With the annotation approach, that address lives on the pod itself, and CLRP just references it, keeping the source of truth in one place. Am I missing something?

@liyihuang
Copy link
Copy Markdown
Contributor Author

My concern is specifically about the relationship between localEndpointSelector and toIP. LocalNode IPs belong to the node, not to any particular pod, all hostNetwork=true pods share the same set of node IPs. So even with the LocalNode IP check, a user could select an unrelated hostNetwork=true DaemonSet (e.g. kube-proxy, cilium-agent) via localEndpointSelector, specify toIP: "169.254.169.252", and the validation would still pass. There is no way to verify that the selected workload is the one actually listening on toIP.

yes. any those validation is only making things a bit better. My question is can we just trust the annotation? If we have another DS in the host network NS using the annocation to steal the traffic for 169.254.169.252. How do we make the decision who can own this IP? If we want to determine by the program args, that's diffcult. 169.254.169.252 can come from flags through envar, args or from config map etc...

In my view, there is no perfect answer, we have to trust something and believe that's the source of the truth through k8s RBAC. If we trust LRP with k8s RBAC, toIP is enough. if we trust the annotation from DS manifest with k8s RBAC, annotation is enough.

WDYT?

@ysksuzuki
Copy link
Copy Markdown
Member

ysksuzuki commented Mar 16, 2026

My point is not about which source of information, annotation or toIP, is more trustworthy, but rather about where that information should live. My concern with toIP is that it breaks the API semantics of CLRP.

Today, CLRP users only specify the backend via localEndpointSelector, and the actual redirect destination IP is resolved by the LRP controller (cilium-agent) from the Pod's information. With toIP, the CLRP user directly specifies the redirect destination IP, which effectively makes localEndpointSelector meaningless.

If instead the backend Pod's owner declares the IP via an annotation, and the LRP controller (cilium-agent) uses that annotation to resolve the redirect destination IP, then CLRP users continue to specify backends solely through localEndpointSelector, preserving the existing API semantics.

Whether it's an annotation or toIP, a human is hardcoding an IP somewhere, so we have no choice but to trust that input to some extent. That said, I do think minimal validation (e.g. hostNetwork=true, node local IP, etc.) is still necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/lrp Impacts Local Redirect Policy. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CFP: supporting local redirect policy for new gke-metadata-server