Add toIP to CiliumLocalRedirectPolicy redirectBackend#41645
Add toIP to CiliumLocalRedirectPolicy redirectBackend#41645liyihuang wants to merge 7 commits intocilium:mainfrom
Conversation
745c66e to
1c79e79
Compare
|
/test |
1c79e79 to
3fc4c9e
Compare
|
/test |
3fc4c9e to
44669bf
Compare
|
/test |
|
puting back to draft since I think the test failiure is not the flaky CI |
qmonnet
left a comment
There was a problem hiding this comment.
Thanks! Here are some small doc nits from my side
44669bf to
4f9e02a
Compare
0c915b3 to
28d6b2a
Compare
|
I just tested it with GKE use case with such config and LRP config we can get the following warning to reject the lrp config in the end of the day, people need to manage what people can access for API or config. otherwise, people can always override the helm values to access whatever they want. To me this is tighter from LRP perspective but no difference from cluster perspective since people always can protect what can be accessed from the network policy |
Add a new Helm value `localRedirectPolicies.toIPRange` to configure the allowed IP range for the ToIP field in Local Redirect Policies. The configuration maps to the `lrp-to-ip-range` flag in the Cilium ConfigMap. Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
28d6b2a to
4f9f5e8
Compare
|
/test |
|
FYI this topic came up during the community meeting today, so @jrife , @bowei and I briefly discussed it. The recording is here (starts around 26m51s). Core question there is whether we want to extend this existing feature in a way that is powerful but increases the likelihood of the feature to cause breakage of other features (+ corresponding support burden), or whether we try to improve the abstraction of traffic handling for this use case "service with a local cache". We didn't come to any conclusions, probably this needs a bit more ongoing discussion with folks interested in the topic (at least @ysksuzuki , @liyihuang , @Bigdelle also expressed interest). |
|
I’ve watched the recording and here are my thoughts on LRP. Looking at the LRP use cases in the documentation (https://docs.cilium.io/en/latest/network/kubernetes/local-redirect-policy/), the most common example seems to be local DNS. When I browse the LRP codebase, I also notice that most of the documentation and implementation haven’t been touched in the last ~5 years, aside from the recent control-plane rewrite. This makes me think that local caching is the primary (and perhaps only widely used) use case for LRP, and that LRP itself is not broadly adopted in the same way as core features like NetworkPolicy. If that’s the reality, then changing LRP to be cluster-wide, or introducing a new API for a cluster-wide LRP, feels like overkill. If we assume that the main users of LRP are platform engineers, then I’d argue that platform engineers already have plenty of “footguns” that can break a cluster — this wouldn’t be fundamentally different. Personally, I think toIPRange is more than sufficient for the use cases discussed, and it could even be removed. Platform engineers can already use CiliumNetworkPolicy (CNP) to block or control traffic when needed. |
|
I'd like to see some consideration given to what the ideal expression of these node-local caches should look like, and what Cilium's role in that should be. I understand that a node-local cache should be an opt-in feature. The cluster admin needs to deploy it anyway. If there needs to be some way to express to Cilium that there is such a cache and Cilium as the LB needs to behave differently, then that's fine. That's in essence what LRP does today. LRP with Service matchers and Pod matchers provides a very k8s-native way to implement the functionality we need. The question is what happens when we don't have a convenient handle in a k8s resource that expresses "This node-local cache exists on |
AFAIK, We don't have data on how widely LRP is used, so I'm not sure about it. But even if LRP does have a relatively smaller user base compared to core features, I don't think usage alone should be a factor in deciding how strict the design constraints should be.
I agree that platform engineers operate with powerful privileges, but that’s exactly why misconfigurations can have a much larger and more severe impact. Because of that, I think it’s important for APIs like LRP to be designed in a way that minimizes the risk of accidental misconfiguration and provides clear guardrails, rather than assuming that existing “footguns” make additional ones acceptable. I'd like to clarify the intended semantics of If a user specifies an in-cluster endpoint (for example, a Service ClusterIP or a Pod IP) as If Also, how is |
Thanks for reminding me about that. I overall agree that we shouldn't give the user another footgun because they already have some. yes, the current implementation will break what you mentioned there I think more proper way to do the validation GKE/cache use case is to validate it through cilium internal info. Here is my thoughts
In this way, we limit @joestringer @ysksuzuki Please let me know how you think about it. If we all agree, I can drop my previous manual validation and implement this method. |
Do you think we could either infer the right IP or get the user to explicitly advertise the IP to be used as part of their hostNetwork daemonset then skip the I guess the short answer is this clarifies which IP to use if the IP is ambiguous, but then I feel like it's better to set up a clear contract with the user that they should explicitly declare that IP as metadata in the deployment/daemonset rather than configuring it in multiple places and trying to make it line up. On one level this feels somewhat of a trivial question of whether to host the "source of truth" information (which IP) in one resource or two, but on the other hand if we can make the interfaces clearer and reduce the likelihood of future incompatibilities / footguns by avoiding this extra configuration altogether, then that feels like it might be ultimately simpler to operate & maintain. |
I believe it's case-by-case. If users have full control over the Deployment or DaemonSet, we can simply read the configuration. However, in managed Kubernetes environments (like GKE), users often deploy components via a managed console where they don't have direct control over the manifests. This has recently become an issue because the GKE metadata server started listening on a specific IP address following a recent upgrade. They set up iptables rules for redirection, which breaks things for Cilium users who are utilizing eBPF host routing. If there were a contract between Cilium and other providers (like GKE) to use specific annotations to indicate the preferred IP for these special setups, it would be much easier for both us and the users @Bigdelle Is this possible? |
Generally, I think this approach makes sense. It sounds like it would keep CLRPs focused on their original intent and solve this GKE MDS issue. We will discuss this internally with the proper folks to see what we can land on. I will keep you updated, but this sounds promising and may work on our end. |
|
I also would like to add some notes to make it clear. With some annotation, we still use same mechanism to verify the ownership of the IP address, so people can't use it as the DNAT tool to random IP address. so the following logis is still valid where we just dont change the frontend IP on LRP but using some annotation from pod.
|
|
@Bigdelle any update on this? |
I'll have an update soon. Thanks for checking in. |
|
@liyihuang
We ideally see just one resource (the CLRP) controlling the behavior of the redirection, rather than definitions split between the data plane and control plane, whether through direct fields in the API or annotations in the CLRP. On a related note, to limit the scope and surface of this feature, would it be possible to limit the |
You could probably use BPF socket redirect and lookup helpers to enforce this in the implementation: |
|
@Bigdelle FYI we're rethinking CLRP also over here: #44138 . Might be worth a perusal. I agree that managing multiple resources is a pain. That's why I'm interested to explore if we even need the LRP abstraction at all or whether we can conform service/NAT handling closer to the official k8s service objects. |
I'm algined with this just from this PR perspective and agree that we should have an easy way for users to consume the API.
I personally think validating on control plane is making more sense since that can help us to keep the data path simple |
|
@liyihuang @Bigdelle We discussed this internally and wanted to ask: has using With #41275, excluded addresses fall back to the kernel stack instead of being handled by eBPF host routing. This would allow GKE's iptables DNAT rule ( EDIT: Sorry for the confusion. Let me double-check whether |
I don't want to speak for Liyi, but we'd like to still use host routing, and do the redirection via eBPF. I like your suggestion, and this can definitely be used as a stop-gap solution, but the final goal would be to be able to redirect in host-routing mode to these link-local addresses. |
|
@Bigdelle Our idea is to pass only Compute Engine MDS traffic to the kernel stack, while continuing to handle all other regular traffic through BPF host routing. Would that still be difficult? Is the goal to replace the GKE MDS iptables DNAT rule (169.254.169.254:80 -> 169.254.169.252:988) with a BPF-based mechanism? |
Yes that's right, the goal would be to replace the DNAT rule with a BPF-based solution if possible. The flag/PR that you linked is very useful for us, but the long-term goal would be to fully transition this redirection to BPF. |
Got it, thanks for confirming. In that case, it seems we would need some kind of LRP extension. That said, I’m against adding Unless we can verify that the IP specified in My current view is that it would be better if the target IP could be derived from the Pod information of the workload selected by For example: apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gke-mds
namespace: kube-system
spec:
selector:
matchLabels:
app: gke-mds
template:
metadata:
labels:
app: gke-mds
annotations:
cilium.io/redirect-backend-ip: "169.254.169.252"
spec:
hostNetwork: true
containers:
- name: gke-mds
image: example/gke-mds:latest
ports:
- containerPort: 988
hostPort: 988 |
|
I started my PTO yesterday and took a long flight and was late for the party. @ysksuzuki the purpose of this PR from beginning is to provide a BPF way to redirect the traffic so people can use the ebpf host routing. If we rely on kernel stack, the traffic can just follow GKE's iptable rules.
I'm not sure if I understand your concern here. If we only allow pod to do this override with hostNetwork=true(I think that's the only case that we can't get the right IP address from k8s server). we will be able to know if that pod owns the IP since agent is also in the host network NS. In that view, it's not the random DNAT but the DNAT to the IP that owned by the pod. Here is the logic in my head
I think it's safely cover the use case and address those valid security concern. I personally agree with #41645 (comment) I think it's more confusing for people to configure things at different places |
|
@liyihuang Thanks for the detailed explanation. My concern is specifically about the relationship between
With the |
yes. any those validation is only making things a bit better. My question is can we just trust the annotation? If we have another DS in the host network NS using the annocation to steal the traffic for In my view, there is no perfect answer, we have to trust something and believe that's the source of the truth through k8s RBAC. If we trust LRP with k8s RBAC, WDYT? |
|
My point is not about which source of information, annotation or Today, CLRP users only specify the backend via If instead the backend Pod's owner declares the IP via an annotation, and the LRP controller (cilium-agent) uses that annotation to resolve the redirect destination IP, then CLRP users continue to specify backends solely through Whether it's an annotation or |
see commit message
Fixes: #41671