Local Redirect Policy for local DNS cache - observing intermittent DNS failures on new nodes if node-local-dns pod has not yet started fully

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Version

equal or higher than v1.18.4 and lower than v1.19.0

### What happened?

We are trying to find the root cause of a DNS issue we have observed in EKS 1.33 running Cilium > 1.17.9 e.g. 1.18.0 onwards

With a localredirectpolicy enabled for node local dns cache, when a new node is created, DNS queries fail on any pod that is started before node local dns is ready.

In Cilium 1.17.9 or lower on the same cluster we do not observe any DNS failures in new nodes.



### How can we reproduce the issue?

Our configuration:
EKS 1.33
Cilium 1.18.x (seems to impact all versions have tested with 1.18.1, 1.18.4 and 1.19.0 pre versions) (installed with Helm via Terraform)
node local dns 1.26.0
Cilium localredirectpolicy for DNS
Karpenter 1.6.1


We deploy a set of  large pods e.g. 16CPU, into a karpenter node pool, this schedules each pod onto a new node (one pod per node).

The pod itself runs a curl of an internal website, hosted outside of the kubernetes cluster, it does this test every 5 seconds, so this should be testing the DNS path from the pod to node local dns, to the coredns deployment and then outward to the company dns infrastucture.

We start 20 pods and read the pod logs, looking for HTTP response codes from the curl test.

We observe with Cilium 1.18.0+ that this test fails between 25% - 50% of the time in every run.
With Cilium 1.17.9 the test passes every time.

A failed test is any pod that has initial DNS issues, all pods eventually have functioning DNS (so the first couple of tests fail and once node local dns is running the tests start to pass). This type of failure causes issues in very elastic workloads that do not handle the absence of DNS at startup.

and for a comparison, on an EKS 1.30 cluster with Cilium 1.16.5 the test always passes.

Here is our lrp

apiVersion: cilium.io/v2
kind: CiliumLocalRedirectPolicy
spec:
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
    - name: udp-53
      port: "53"
      protocol: UDP
    - name: tcp-53
      port: "53"
      protocol: TCP
  redirectFrontend:
    serviceMatcher:
      namespace: kube-system
      serviceName: kube-dns
  skipRedirectFromBackend: false


alternative test:

In a node configured with node local dns and cilium lrp above, break the node local dns pod (i did this by setting an invalid image in the daemonset and then killing the local pod so it gets stuck in an imagebackoff loop).

In Cilium 1.17.10, dns recovers within a second and you can query new domains without issue
In Cilium 1.18.4 dns dies.


### Cilium Version

All versions of 1.18.x and 1.19.x.

### Kernel Version

6.12.40-63.114.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug  7 19:30:51 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

### Kubernetes Version

Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.33.5-eks-3025e55

### Regression

1.17.9 (all versions of 1.17.x we tested so far do not have this issue).

### Sysdump

_No response_

### Relevant log output

```shell

```

### Anything else?

_No response_

### Cilium Users Document

- [x] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Redirect Policy for local DNS cache - observing intermittent DNS failures on new nodes if node-local-dns pod has not yet started fully #43080

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Local Redirect Policy for local DNS cache - observing intermittent DNS failures on new nodes if node-local-dns pod has not yet started fully #43080

Description

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions