Skip to content

Local Redirect Policy for local DNS cache - observing intermittent DNS failures on new nodes if node-local-dns pod has not yet started fully #43080

@mattrawles

Description

@mattrawles

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.18.4 and lower than v1.19.0

What happened?

We are trying to find the root cause of a DNS issue we have observed in EKS 1.33 running Cilium > 1.17.9 e.g. 1.18.0 onwards

With a localredirectpolicy enabled for node local dns cache, when a new node is created, DNS queries fail on any pod that is started before node local dns is ready.

In Cilium 1.17.9 or lower on the same cluster we do not observe any DNS failures in new nodes.

How can we reproduce the issue?

Our configuration:
EKS 1.33
Cilium 1.18.x (seems to impact all versions have tested with 1.18.1, 1.18.4 and 1.19.0 pre versions) (installed with Helm via Terraform)
node local dns 1.26.0
Cilium localredirectpolicy for DNS
Karpenter 1.6.1

We deploy a set of large pods e.g. 16CPU, into a karpenter node pool, this schedules each pod onto a new node (one pod per node).

The pod itself runs a curl of an internal website, hosted outside of the kubernetes cluster, it does this test every 5 seconds, so this should be testing the DNS path from the pod to node local dns, to the coredns deployment and then outward to the company dns infrastucture.

We start 20 pods and read the pod logs, looking for HTTP response codes from the curl test.

We observe with Cilium 1.18.0+ that this test fails between 25% - 50% of the time in every run.
With Cilium 1.17.9 the test passes every time.

A failed test is any pod that has initial DNS issues, all pods eventually have functioning DNS (so the first couple of tests fail and once node local dns is running the tests start to pass). This type of failure causes issues in very elastic workloads that do not handle the absence of DNS at startup.

and for a comparison, on an EKS 1.30 cluster with Cilium 1.16.5 the test always passes.

Here is our lrp

apiVersion: cilium.io/v2
kind: CiliumLocalRedirectPolicy
spec:
redirectBackend:
localEndpointSelector:
matchLabels:
k8s-app: node-local-dns
toPorts:
- name: udp-53
port: "53"
protocol: UDP
- name: tcp-53
port: "53"
protocol: TCP
redirectFrontend:
serviceMatcher:
namespace: kube-system
serviceName: kube-dns
skipRedirectFromBackend: false

alternative test:

In a node configured with node local dns and cilium lrp above, break the node local dns pod (i did this by setting an invalid image in the daemonset and then killing the local pod so it gets stuck in an imagebackoff loop).

In Cilium 1.17.10, dns recovers within a second and you can query new domains without issue
In Cilium 1.18.4 dns dies.

Cilium Version

All versions of 1.18.x and 1.19.x.

Kernel Version

6.12.40-63.114.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 7 19:30:51 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.33.5-eks-3025e55

Regression

1.17.9 (all versions of 1.17.x we tested so far do not have this issue).

Sysdump

No response

Relevant log output

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.area/lrpImpacts Local Redirect Policy.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions