fix: improve default readiness probe config for shutdown#204
fix: improve default readiness probe config for shutdown#204hagaibarel merged 1 commit intocoredns:masterfrom
Conversation
On shutdown coredns will wait what is configured for lameduck and continue handling connections. At the same time it will fail readiness probes. In order to give readiness checks a chance to remove the instance from the service we need to lower the failure threshold and the interval. This is to avoid failing DNS requests in a busy cluster when coredns is being scaled down. Signed-off-by: Heiko Voigt <heiko.voigt@jimdo.com>
|
Thanks a lot for the effort! |
Adopt changes from coredns/helm#204
Adopt changes from coredns/helm#204
|
@hvoigt I don't feel, optimising I did similar scale-in tests from 20 to 2 coredns pods and what I found is the following:
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8181
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1Using the above
I am aware that this comes with a downside of longer rollout and scale-in time for coredns pods, but I believe this makes more sense from an operational perspective. Some background for this approach: $ kubectl get endpointslices.discovery.k8s.io -n kube-system -l=k8s-app=kube-dns -o yaml
apiVersion: v1
items:
- addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- <redacted>
conditions:
ready: false
serving: true
terminating: true
...Using |
Adopt changes from coredns/helm#204 Signed-off-by: Arnaud Meukam <ameukam@gmail.com>

On shutdown coredns will wait what is configured for lameduck and continue handling connections. At the same time it will fail readiness probes.
In order to give readiness checks a chance to remove the instance from the service we need to lower the failure threshold and the interval.
This is to avoid failing DNS requests in a busy cluster when coredns is being scaled down.
Why is this pull request needed and what does it do?
We used a dns-test-container and did the following procedure to get this result:
We configured coredns as it is default by EKS
lameduck 5sandreadinessProbe: periodSeconds=10s failureThreshold=3This test was executed in a cluster running ~1000 pods.
With this change in place we get no losses in DNS resolution when repeating this test
Which issues (if any) are related?
The issue is described above.
Checklist:
Changes are automatically published when merged to
main. They are not published on branches.Note on DCO
If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.