Consul not returning NXDOMAIN for DNS queries to a non-existent datacenter#8218
Consul not returning NXDOMAIN for DNS queries to a non-existent datacenter#8218mkeeler merged 3 commits intohashicorp:masterfrom
Conversation
pierresouchay
left a comment
There was a problem hiding this comment.
Nice finding, yes, did not expected RPC not going to a server
mkeeler
left a comment
There was a problem hiding this comment.
Thanks for the PR, this was a good find and an excellent PR.
|
🍒✅ Cherry pick of commit e0f9e4a onto |
|
Glad to help, folks! 🎩 |
|
btw, using errors chains by wrapping it with |
This would involve refactoring of error handling across the whole project (Consul is from the pre pkg/errors age). That was not the aim of this PR, however. Surely can be done as a follow up. |
sure, it's not related to the PR. I was bringing attention of maintainers. |
Overview
Recently we found an issue when a DNS query sent to Consul for a non-existent domain makes client agent to keep querying servers indefinitely for that domain, when DNS cache is enabled, i.e.
dns_config.use_cacheis set totrue. We were able to reproduce the issue locally with the most recent versions (1.7.4, 1.8.0):In addition to be an annoying bug, this is also a DDoS vector allowing to quickly exhaust memory on all Consul servers in the cluster by flooding a single Consul agent with random queries in non-existent domains. Once DNS cache is disabled, the agent sends only a single query to the servers available.
All of this is happening due to the bug where
SERVFAILis returned whenNXDOMAINshould be. The bug was supposed to be fixed in #8103, but unfortunately it's not for the case where a DNS query goes through a non-server agent first:Configuration used to reproduce the issue: https://gist.github.com/yurkeen/7bc4bc12c7551b8c979cbf01a3cb5bea
Reasons
There are two consequent reasons why this issue is happening:
I. The test provided in #8103,
TestDNS_NonExistingDCchecks the DNS response received from the agent, when the agent is a Consul server itself. However, this test does not cover the case when DNS resolution happens over RPC through client agent reaching a server agent.II. The error being provided for the
computeRCode()method in dns.go is compared to the"No path to datacenter"string. In case when resolution happens through a Consul agent, the error looks slightly different:"rpc error making call: No path to datacenter", i.e. RPC layer adds more context to this error causing the condition to fail and slip through to returning the defaultdns.RcodeServerFailure.Solution
This PR is split into three commits:
3b4ddaaadds a test to make sure NXDOMAIN is returned when Consul server agent is queried over RPC by a non-server agent. If ran without applying the next commit8d18422, the test will fail.8d18422adds a fix to compare errors properly, so NXDOMAIN is returned for both cases — direct on the server agent and over RPC;10361ddis a bonus track adding small refactoring of theErrQueryNotFounderror and adding theIsErrQueryNotFound(err)helper for it to align it with the rest of the errors.