Is there an existing issue for this?
Version
equal or higher than v1.18.4 and lower than v1.19.0
What happened?
We are trying to find the root cause of a DNS issue we have observed in EKS 1.33 running Cilium > 1.17.9 e.g. 1.18.0 onwards
With a localredirectpolicy enabled for node local dns cache, when a new node is created, DNS queries fail on any pod that is started before node local dns is ready.
In Cilium 1.17.9 or lower on the same cluster we do not observe any DNS failures in new nodes.
How can we reproduce the issue?
Our configuration:
EKS 1.33
Cilium 1.18.x (seems to impact all versions have tested with 1.18.1, 1.18.4 and 1.19.0 pre versions) (installed with Helm via Terraform)
node local dns 1.26.0
Cilium localredirectpolicy for DNS
Karpenter 1.6.1
We deploy a set of large pods e.g. 16CPU, into a karpenter node pool, this schedules each pod onto a new node (one pod per node).
The pod itself runs a curl of an internal website, hosted outside of the kubernetes cluster, it does this test every 5 seconds, so this should be testing the DNS path from the pod to node local dns, to the coredns deployment and then outward to the company dns infrastucture.
We start 20 pods and read the pod logs, looking for HTTP response codes from the curl test.
We observe with Cilium 1.18.0+ that this test fails between 25% - 50% of the time in every run.
With Cilium 1.17.9 the test passes every time.
A failed test is any pod that has initial DNS issues, all pods eventually have functioning DNS (so the first couple of tests fail and once node local dns is running the tests start to pass). This type of failure causes issues in very elastic workloads that do not handle the absence of DNS at startup.
and for a comparison, on an EKS 1.30 cluster with Cilium 1.16.5 the test always passes.
Here is our lrp
apiVersion: cilium.io/v2
kind: CiliumLocalRedirectPolicy
spec:
redirectBackend:
localEndpointSelector:
matchLabels:
k8s-app: node-local-dns
toPorts:
- name: udp-53
port: "53"
protocol: UDP
- name: tcp-53
port: "53"
protocol: TCP
redirectFrontend:
serviceMatcher:
namespace: kube-system
serviceName: kube-dns
skipRedirectFromBackend: false
alternative test:
In a node configured with node local dns and cilium lrp above, break the node local dns pod (i did this by setting an invalid image in the daemonset and then killing the local pod so it gets stuck in an imagebackoff loop).
In Cilium 1.17.10, dns recovers within a second and you can query new domains without issue
In Cilium 1.18.4 dns dies.
Cilium Version
All versions of 1.18.x and 1.19.x.
Kernel Version
6.12.40-63.114.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 7 19:30:51 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.33.5-eks-3025e55
Regression
1.17.9 (all versions of 1.17.x we tested so far do not have this issue).
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct
Is there an existing issue for this?
Version
equal or higher than v1.18.4 and lower than v1.19.0
What happened?
We are trying to find the root cause of a DNS issue we have observed in EKS 1.33 running Cilium > 1.17.9 e.g. 1.18.0 onwards
With a localredirectpolicy enabled for node local dns cache, when a new node is created, DNS queries fail on any pod that is started before node local dns is ready.
In Cilium 1.17.9 or lower on the same cluster we do not observe any DNS failures in new nodes.
How can we reproduce the issue?
Our configuration:
EKS 1.33
Cilium 1.18.x (seems to impact all versions have tested with 1.18.1, 1.18.4 and 1.19.0 pre versions) (installed with Helm via Terraform)
node local dns 1.26.0
Cilium localredirectpolicy for DNS
Karpenter 1.6.1
We deploy a set of large pods e.g. 16CPU, into a karpenter node pool, this schedules each pod onto a new node (one pod per node).
The pod itself runs a curl of an internal website, hosted outside of the kubernetes cluster, it does this test every 5 seconds, so this should be testing the DNS path from the pod to node local dns, to the coredns deployment and then outward to the company dns infrastucture.
We start 20 pods and read the pod logs, looking for HTTP response codes from the curl test.
We observe with Cilium 1.18.0+ that this test fails between 25% - 50% of the time in every run.
With Cilium 1.17.9 the test passes every time.
A failed test is any pod that has initial DNS issues, all pods eventually have functioning DNS (so the first couple of tests fail and once node local dns is running the tests start to pass). This type of failure causes issues in very elastic workloads that do not handle the absence of DNS at startup.
and for a comparison, on an EKS 1.30 cluster with Cilium 1.16.5 the test always passes.
Here is our lrp
apiVersion: cilium.io/v2
kind: CiliumLocalRedirectPolicy
spec:
redirectBackend:
localEndpointSelector:
matchLabels:
k8s-app: node-local-dns
toPorts:
- name: udp-53
port: "53"
protocol: UDP
- name: tcp-53
port: "53"
protocol: TCP
redirectFrontend:
serviceMatcher:
namespace: kube-system
serviceName: kube-dns
skipRedirectFromBackend: false
alternative test:
In a node configured with node local dns and cilium lrp above, break the node local dns pod (i did this by setting an invalid image in the daemonset and then killing the local pod so it gets stuck in an imagebackoff loop).
In Cilium 1.17.10, dns recovers within a second and you can query new domains without issue
In Cilium 1.18.4 dns dies.
Cilium Version
All versions of 1.18.x and 1.19.x.
Kernel Version
6.12.40-63.114.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 7 19:30:51 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.33.5-eks-3025e55
Regression
1.17.9 (all versions of 1.17.x we tested so far do not have this issue).
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct