fix: improve default readiness probe config for shutdown by hvoigt · Pull Request #204 · coredns/helm

hvoigt · 2025-03-24T15:58:16Z

On shutdown coredns will wait what is configured for lameduck and continue handling connections. At the same time it will fail readiness probes.

In order to give readiness checks a chance to remove the instance from the service we need to lower the failure threshold and the interval.

This is to avoid failing DNS requests in a busy cluster when coredns is being scaled down.

Why is this pull request needed and what does it do?

We used a dns-test-container and did the following procedure to get this result:

We configured coredns as it is default by EKS lameduck 5s and readinessProbe: periodSeconds=10s failureThreshold=3

Before the test we scaled coredns to 20 instances manually
We started the tests and waited roughly 10 seconds
We scaled coredns to 1 instance manually
Waited for the test to finish

This test was executed in a cluster running ~1000 pods.

With this change in place we get no losses in DNS resolution when repeating this test

Which issues (if any) are related?

The issue is described above.

Checklist:

I have bumped the chart version according to versioning.
I have updated the chart changelog with all the changes that come with this pull request according to changelog.
Any new values are backwards compatible and/or have sensible default.
I have signed off all my commits as required by DCO.

Changes are automatically published when merged to main. They are not published on branches.

Note on DCO

If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.

On shutdown coredns will wait what is configured for lameduck and continue handling connections. At the same time it will fail readiness probes. In order to give readiness checks a chance to remove the instance from the service we need to lower the failure threshold and the interval. This is to avoid failing DNS requests in a busy cluster when coredns is being scaled down. Signed-off-by: Heiko Voigt <heiko.voigt@jimdo.com>

hagaibarel · 2025-03-25T18:29:51Z

Thanks a lot for the effort!

Adopt changes from coredns/helm#204

youwalther65 · 2026-01-05T16:17:25Z

@hvoigt I don't feel, optimising readinessProbe just for shutdown with failureThreshold: 1 is the best approach.

I did similar scale-in tests from 20 to 2 coredns pods and what I found is the following:

set lameduck slightly larger then readinessProbe's (failureThreshold * periodSeconds)
set readinessProbe to something reasonable like:

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

Using the above readinessProbe with a lameduckof 16 or 17s worked perfectly fine and I saw no errors.

I am aware that this comes with a downside of longer rollout and scale-in time for coredns pods, but I believe this makes more sense from an operational perspective.

Some background for this approach:
lameduck is an attribute of health plugin and used in the pod's livenessProbeand it's sole purpose regarding scale-in/shutdown is to delay shutdown for DURATION.
The readinessProbedetermines, when an endpoint is finally removed from the corresponding endpointslices object. When a pod is deleted via scale-in by replica controller, it will get a termination timestamp, but due to lameduck is still Running and could still be potentially used to serve traffic according to upstream K8s doc here:

$ kubectl get endpointslices.discovery.k8s.io -n kube-system -l=k8s-app=kube-dns -o yaml
apiVersion: v1
items:
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - <redacted>
    conditions:
      ready: false
      serving: true
      terminating: true
...

Using lameduck > (failureThreshold * periodSeconds)makes sure that coredns process is available long enough until pod is marked Ready: False, i.e. 3 unsuccessful readinessProbes.

Adopt changes from coredns/helm#204 Signed-off-by: Arnaud Meukam <ameukam@gmail.com>

hagaibarel merged commit ed11181 into coredns:master Mar 25, 2025
2 checks passed

Kyslik mentioned this pull request Apr 15, 2025

coredns pods don't wait for lameduck duration before shutting down (coredns v1.8.3 on EKS v1.20) coredns/coredns#5409

Closed

hvoigt mentioned this pull request Apr 22, 2025

[EKS] [Addon CoreDNS]: Allow Best Practises Configuration aws/containers-roadmap#2595

Open

jan-kantert mentioned this pull request Oct 15, 2025

DNS failures during CoreDNS shutdown kubernetes/kops#17677

Closed

jan-kantert added a commit to jan-kantert/kops that referenced this pull request Oct 15, 2025

Prevent failing DNS requests on shutdown of CoreDNS pods

695a820

Adopt changes from coredns/helm#204

jan-kantert mentioned this pull request Oct 15, 2025

Prevent failing DNS requests on shutdown of CoreDNS pods kubernetes/kops#17678

Merged

hakman pushed a commit to hakman/kops that referenced this pull request Oct 16, 2025

Prevent failing DNS requests on shutdown of CoreDNS pods

b27c203

Adopt changes from coredns/helm#204

blanchardma mentioned this pull request Dec 9, 2025

Potential wrong CoreDNS lameduck value aws/aws-eks-best-practices#728

Open

ameukam pushed a commit to ameukam/kops that referenced this pull request Feb 25, 2026

Prevent failing DNS requests on shutdown of CoreDNS pods

62e7510

Adopt changes from coredns/helm#204 Signed-off-by: Arnaud Meukam <ameukam@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve default readiness probe config for shutdown#204

fix: improve default readiness probe config for shutdown#204
hagaibarel merged 1 commit intocoredns:masterfrom
Jimdo:ktlo-graceful-shutdown

hvoigt commented Mar 24, 2025

Uh oh!

Uh oh!

hagaibarel commented Mar 25, 2025

Uh oh!

youwalther65 commented Jan 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hvoigt commented Mar 24, 2025

Why is this pull request needed and what does it do?

Which issues (if any) are related?

Uh oh!

Uh oh!

hagaibarel commented Mar 25, 2025

Uh oh!

youwalther65 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

youwalther65 commented Jan 5, 2026 •

edited

Loading