Skip to content

fix(dnstap): Better error handling (redial & logging) when Dnstap is busy#7619

Merged
yongtang merged 5 commits into
coredns:masterfrom
aliciay64:fix-dnstap
Nov 6, 2025
Merged

fix(dnstap): Better error handling (redial & logging) when Dnstap is busy#7619
yongtang merged 5 commits into
coredns:masterfrom
aliciay64:fix-dnstap

Conversation

@aliciay64

@aliciay64 aliciay64 commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

1. Why is this pull request needed and what does it do?

Bug / Current behaviour

Both behaviours are caused by the same bug, when the internal queue is always full, we re-enter the for loop very frequently and keeps resetting the flush timer, so the timer never expires to enter the flush clause.

  • When Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
  • Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty.

Changes

  • Change Timer and Reset to time.Ticker() so that the flush/error reporting/redial actions are triggered every 1s
  • Return error from write() on missing encoder, so that the plugin redials the socket on every failed messages.

2. Which issues (if any) are related?

None

3. Which documentation changes (if any) need to be made?

Don't think anything needs to be particularly noted on docs.

4. Does this introduce a backward incompatible change or deprecation?

No

@aliciay64 aliciay64 requested a review from yongtang as a code owner October 16, 2025 07:30
Signed-off-by: xyang378 <xyang378@bloomberg.net>

@rdrozhdzh rdrozhdzh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug / Current behaviour

If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.

Why do you think so? Do you have any evidence?
According to the golang specification (https://go.dev/ref/spec#Select_statements)

  1. If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

Comment thread plugin/dnstap/io.go Outdated
Comment thread plugin/dnstap/io.go Outdated
queue: make(chan *tap.Dnstap, multipleQueue*queueSize),
quit: make(chan struct{}),
flushTimeout: flushTimeout,
errorCheckTimeout: errorCheckTimeout,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are not going to make it configurable, you don't need this property in dio, just use the constant where needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making errorCheckInterval configurable so that we can make unit tests faster.

Comment thread plugin/dnstap/io.go Outdated
Comment thread plugin/dnstap/io.go Outdated
Comment thread plugin/dnstap/io.go Outdated
@aliciay64

Copy link
Copy Markdown
Contributor Author

Bug / Current behaviour

If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.

Why do you think so? Do you have any evidence? According to the golang specification (https://go.dev/ref/spec#Select_statements)

  1. If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

Yes, here are the steps to reproduce the issue.

Set up:

  1. Since this issue only happens under high volume of queries filling up the internal buffer, reduce the queue size for demo purposes: change queueSize = 10 at line 14 plugin/dnstap/io.go
  2. Build coreDNS, run ./coreDNS with the following Corefile that enables dnstap.
.:53 {
    dnstap /tmp/dnstap.sock full
    whoami
}
  1. A simple script dig_loop.sh that sends DNS queries to coreDNS
#!/bin/bash
trap "echo 'Interrupted by user'; exit 0" SIGINT # Trap SIGINT (Ctrl+C)
while true; do
      dig @127.0.0.1 -p 53 +short www.example.com
done
  1. Configure some dnstap listener to listen on /tmp/dnstap.sock, in this case I use dns-collector

Reproduce

  1. Start coredns, then start dig_loop.sh in another terminal, coredns will print something like [ERROR] plugin/dnstap: No connection to dnstap endpoint: dial unix /tmp/dnstap.sock: connect: no such file or directory .:53
  2. coreDNS's dnstap should report dropped messages every flushTimeout, which is 1s, but observe no log is printed.
  3. After some long time, stop the dig_loop.sh script, and can see coreDNS logs all dropped messages from beginning to now.
  4. Start the dig_loop.sh script again, then start dns-collector to listen on the socket. At this point dnstap should've tried to reconnect to the output socket, but it doesn't.
  5. After some time, stop the dig_loop.sh script, then on the dns-collector side, we can see dnstap is connected. And if we send more dig to coreDNS again, we see now dnstap is connected to output.

@aliciay64

Copy link
Copy Markdown
Contributor Author

Bug / Current behaviour

If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.

Why do you think so? Do you have any evidence? According to the golang specification (https://go.dev/ref/spec#Select_statements)

  1. If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

Yes you are right about the randomness of the select statement, thanks for pointing that out.
That let me realize the bug was not in the select, but in timer.Reset(). When the queue is busy, we re-enter the for loop very frequently and keeps resetting the flush timer, so the timer never expires to enter the flush clause.

The fix is now much simpler, just change to use Ticker. I'll also update the description for this PR.

Signed-off-by: xyang378 <xyang378@bloomberg.net>
Comment thread plugin/dnstap/io.go Outdated
if d.enc == nil {
atomic.AddUint32(&d.dropped, 1)
return nil
return noOutputError

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dnstap server is not available (e.g. misconfigured), returning the error here will result in frequent attempts to reconnect (on each DNSTAP message), whereas the original code implied one re-connection attempt per flush interval.
Not sure if the new behavior would be better.

@aliciay64 aliciay64 Oct 23, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, this redial frequency might be too aggressive.
Hmm but the original behaviour just ignores the no encoder error, there should at least be some error logging. Now that I think about it, it would still be cleaner to decouple the flush logic from error handling logic like in my first commit, even though it's not needed for changing the behaviour, it's better for future maintenance

Signed-off-by: xyang378 <xyang378@bloomberg.net>
Comment thread plugin/dnstap/io.go Outdated
@codecov

codecov Bot commented Oct 24, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.29%. Comparing base (93c57b6) to head (04be7f7).
⚠️ Report is 1725 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7619      +/-   ##
==========================================
+ Coverage   55.70%   63.29%   +7.59%     
==========================================
  Files         224      278      +54     
  Lines       10016    15132    +5116     
==========================================
+ Hits         5579     9578    +3999     
- Misses       3978     4871     +893     
- Partials      459      683     +224     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: xyang378 <xyang378@bloomberg.net>

CR comment
@aliciay64

Copy link
Copy Markdown
Contributor Author

@rdrozhdzh @yongtang Hi, I have updated the PR with the code review changes, are there other things we need to do before merging this?

Signed-off-by: xyang378 <xyang378@bloomberg.net>
@yongtang yongtang merged commit 59afd4b into coredns:master Nov 6, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants