fix(dnstap): Better error handling (redial & logging) when Dnstap is busy by aliciay64 · Pull Request #7619 · coredns/coredns

aliciay64 · 2025-10-16T07:30:07Z

1. Why is this pull request needed and what does it do?

Bug / Current behaviour

Both behaviours are caused by the same bug, when the internal queue is always full, we re-enter the for loop very frequently and keeps resetting the flush timer, so the timer never expires to enter the flush clause.

When Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty.

Changes

Change Timer and Reset to time.Ticker() so that the flush/error reporting/redial actions are triggered every 1s
Return error from write() on missing encoder, so that the plugin redials the socket on every failed messages.

2. Which issues (if any) are related?

None

3. Which documentation changes (if any) need to be made?

Don't think anything needs to be particularly noted on docs.

4. Does this introduce a backward incompatible change or deprecation?

No

Signed-off-by: xyang378 <xyang378@bloomberg.net>

rdrozhdzh

Bug / Current behaviour

If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.

Why do you think so? Do you have any evidence?
According to the golang specification (https://go.dev/ref/spec#Select_statements)

If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

rdrozhdzh · 2025-10-16T12:04:55Z

+		queue:             make(chan *tap.Dnstap, multipleQueue*queueSize),
+		quit:              make(chan struct{}),
+		flushTimeout:      flushTimeout,
+		errorCheckTimeout: errorCheckTimeout,


if you are not going to make it configurable, you don't need this property in dio, just use the constant where needed

Making errorCheckInterval configurable so that we can make unit tests faster.

aliciay64 · 2025-10-22T14:40:38Z

Bug / Current behaviour
If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.
Why do you think so? Do you have any evidence? According to the golang specification (https://go.dev/ref/spec#Select_statements)

If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

Yes, here are the steps to reproduce the issue.

Set up:

Since this issue only happens under high volume of queries filling up the internal buffer, reduce the queue size for demo purposes: change queueSize = 10 at line 14 plugin/dnstap/io.go
Build coreDNS, run ./coreDNS with the following Corefile that enables dnstap.

.:53 {
    dnstap /tmp/dnstap.sock full
    whoami
}

A simple script dig_loop.sh that sends DNS queries to coreDNS

#!/bin/bash
trap "echo 'Interrupted by user'; exit 0" SIGINT # Trap SIGINT (Ctrl+C)
while true; do
      dig @127.0.0.1 -p 53 +short www.example.com
done

Configure some dnstap listener to listen on /tmp/dnstap.sock, in this case I use dns-collector

Reproduce

Start coredns, then start dig_loop.sh in another terminal, coredns will print something like [ERROR] plugin/dnstap: No connection to dnstap endpoint: dial unix /tmp/dnstap.sock: connect: no such file or directory .:53
coreDNS's dnstap should report dropped messages every flushTimeout, which is 1s, but observe no log is printed.
After some long time, stop the dig_loop.sh script, and can see coreDNS logs all dropped messages from beginning to now.
Start the dig_loop.sh script again, then start dns-collector to listen on the socket. At this point dnstap should've tried to reconnect to the output socket, but it doesn't.
After some time, stop the dig_loop.sh script, then on the dns-collector side, we can see dnstap is connected. And if we send more dig to coreDNS again, we see now dnstap is connected to output.

aliciay64 · 2025-10-22T16:08:24Z

Bug / Current behaviour
If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours.
Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.
Why do you think so? Do you have any evidence? According to the golang specification (https://go.dev/ref/spec#Select_statements)

If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

Yes you are right about the randomness of the select statement, thanks for pointing that out.
That let me realize the bug was not in the select, but in timer.Reset(). When the queue is busy, we re-enter the for loop very frequently and keeps resetting the flush timer, so the timer never expires to enter the flush clause.

The fix is now much simpler, just change to use Ticker. I'll also update the description for this PR.

Signed-off-by: xyang378 <xyang378@bloomberg.net>

rdrozhdzh · 2025-10-22T16:44:21Z

 	if d.enc == nil {
-		atomic.AddUint32(&d.dropped, 1)
-		return nil
+		return noOutputError


If dnstap server is not available (e.g. misconfigured), returning the error here will result in frequent attempts to reconnect (on each DNSTAP message), whereas the original code implied one re-connection attempt per flush interval.
Not sure if the new behavior would be better.

That's fair, this redial frequency might be too aggressive.
Hmm but the original behaviour just ignores the no encoder error, there should at least be some error logging. Now that I think about it, it would still be cleaner to decouple the flush logic from error handling logic like in my first commit, even though it's not needed for changing the behaviour, it's better for future maintenance

Signed-off-by: xyang378 <xyang378@bloomberg.net>

codecov · 2025-10-24T16:55:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.29%. Comparing base (93c57b6) to head (04be7f7).
⚠️ Report is 1725 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7619      +/-   ##
==========================================
+ Coverage   55.70%   63.29%   +7.59%     
==========================================
  Files         224      278      +54     
  Lines       10016    15132    +5116     
==========================================
+ Hits         5579     9578    +3999     
- Misses       3978     4871     +893     
- Partials      459      683     +224

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: xyang378 <xyang378@bloomberg.net> CR comment

aliciay64 · 2025-10-31T11:22:47Z

@rdrozhdzh @yongtang Hi, I have updated the PR with the code review changes, are there other things we need to do before merging this?

Signed-off-by: xyang378 <xyang378@bloomberg.net>

aliciay64 requested a review from yongtang as a code owner October 16, 2025 07:30

Fix dnstap redial & improve logging

4ae3452

Signed-off-by: xyang378 <xyang378@bloomberg.net>

aliciay64 force-pushed the fix-dnstap branch from 1f705ff to 4ae3452 Compare October 16, 2025 08:30

rdrozhdzh reviewed Oct 16, 2025

View reviewed changes

aliciay64 force-pushed the fix-dnstap branch from 1865877 to af0f10e Compare October 22, 2025 16:01

fix CR comments

f813995

Signed-off-by: xyang378 <xyang378@bloomberg.net>

aliciay64 force-pushed the fix-dnstap branch from af0f10e to f813995 Compare October 22, 2025 16:12

rdrozhdzh reviewed Oct 22, 2025

View reviewed changes

redial at interval

16a6972

Signed-off-by: xyang378 <xyang378@bloomberg.net>

rdrozhdzh reviewed Oct 24, 2025

View reviewed changes

Comment thread plugin/dnstap/io.go Outdated

aliciay64 force-pushed the fix-dnstap branch from e5841e5 to 9a5e5cc Compare October 28, 2025 14:18

CR comments & lint

03bfb85

Signed-off-by: xyang378 <xyang378@bloomberg.net> CR comment

aliciay64 force-pushed the fix-dnstap branch from 9a5e5cc to 03bfb85 Compare October 28, 2025 14:20

fix lint

04be7f7

Signed-off-by: xyang378 <xyang378@bloomberg.net>

yongtang approved these changes Nov 4, 2025

View reviewed changes

yongtang merged commit 59afd4b into coredns:master Nov 6, 2025
13 checks passed

BrewTestBot mentioned this pull request Dec 10, 2025

coredns 1.13.2 Homebrew/homebrew-core#257995

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dnstap): Better error handling (redial & logging) when Dnstap is busy#7619

fix(dnstap): Better error handling (redial & logging) when Dnstap is busy#7619
yongtang merged 5 commits into
coredns:masterfrom
aliciay64:fix-dnstap

aliciay64 commented Oct 16, 2025 •

edited

Loading

Uh oh!

rdrozhdzh left a comment

Uh oh!

Uh oh!

rdrozhdzh Oct 16, 2025

Uh oh!

aliciay64 Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aliciay64 commented Oct 22, 2025

Uh oh!

aliciay64 commented Oct 22, 2025

Uh oh!

rdrozhdzh Oct 22, 2025

Uh oh!

aliciay64 Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov Bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

aliciay64 commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aliciay64 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Why is this pull request needed and what does it do?

Bug / Current behaviour

Changes

2. Which issues (if any) are related?

3. Which documentation changes (if any) need to be made?

4. Does this introduce a backward incompatible change or deprecation?

Uh oh!

rdrozhdzh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdrozhdzh Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

aliciay64 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aliciay64 commented Oct 22, 2025

Uh oh!

aliciay64 commented Oct 22, 2025

Uh oh!

rdrozhdzh Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

aliciay64 Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aliciay64 commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aliciay64 commented Oct 16, 2025 •

edited

Loading

aliciay64 Oct 23, 2025 •

edited

Loading

codecov Bot commented Oct 24, 2025 •

edited

Loading