Skip to content

kube-dns: dnsmasq intermittent connection refused #45976

@someword

Description

@someword

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T10:00:30Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T09:42:05Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): PRETTY_NAME="Container Linux by CoreOS 1339.0.0 (Ladybug)"
  • Kernel (e.g. uname -a): 4.10.1-coreos
  • Install tools: custom ansible
  • Others: kube dns related images. gcr.io/google_containers/kubedns-amd64:1.9 and gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1

What happened:
java.net.UnknownHostException: dynamodb.us-east-1.amazonaws.com

What you expected to happen:
Receive a response to the name lookup request.

How to reproduce it (as minimally and precisely as possible):
This is the kicker. We are not able to reproduce this issue on purpose. However we experience this in our production cluster 1 - 500 times a week.

Anything else we need to know:
In the past 2 months or so we had experienced a handful of events where DNS was failing for most/all of our production pods and the event would last for 5 - 10 minutes. During this time the kube-dns service was healthy with 3 - 6 available endpoints at all times. We increased our kube-dns pod count to 20 in 20 node production clusters. This level of provisioning alleviated the DNS issues that were taking down our production services. However we still experience at least weekly smaller events ranging from 1 second to 30 seconds which affect a small subset of pods. During these events 1 - 5 pods on different nodes across the cluster experience a burst of DNS failures which have a much smaller end user impact. We enabled query logging in dnsmasq as we were not sure whether the queries made it from the client pod to one of the kube-dns pods or not. What was interesting is that during the DNS events where query logging was enabled none of the name lookup requests that resulted in an exception were received by dnsmasq. At this point my colleague noticed these errors coming from dnsmasq-metrics

ERROR: logging before flag.Parse: W0517 03:19:50.139060 1 server.go:53] Error getting metrics from dnsmasq: read udp 127.0.0.1:36181->127.0.0.1:53: i/o timeout

That error as near as I can tell is basically a name resolution error from dnsmasq-metrics as it's trying to query the dnsmasq container in the same pod to get dnsmasq's internal metrics similar to running dig +short chaos txt cachesize.bind.

All of our DNS events are happening at the exact same time that 1 or more dnsmasq-metrics container is throwing those errors. We thought we might be possibly exceeding the default 150 connection limit that dnsmasq has but we do not see any logs indicating that. IF we did we would expect to see these log messages

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

Based off of conversations with other cluster operators and users in slack I know that other users are experiencing these same problems. I'm hoping that this issue can be used to centralize our efforts and determine if dnsmasq refusing connections is the problem or a symptom of something else.

Metadata

Metadata

Assignees

Labels

area/dnssig/networkCategorizes an issue or PR as relevant to SIG Network.triage/unresolvedIndicates an issue that can not or will not be resolved.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions