fix(dnstap): Better error handling (redial & logging) when Dnstap is busy#7619
Conversation
Signed-off-by: xyang378 <xyang378@bloomberg.net>
rdrozhdzh
left a comment
There was a problem hiding this comment.
Bug / Current behaviour
If Dnstap fails to connect to output socket at startup, it never redials until the internal buffer queue is empty. When CoreDNS is constantly busy, Dnstap can fail to output for hours. Dropped messages are never reported if the queue constantly receives incoming messages, they are only logged when the buffer is empty and the flush timeout is reached.
Why do you think so? Do you have any evidence?
According to the golang specification (https://go.dev/ref/spec#Select_statements)
- If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.
| queue: make(chan *tap.Dnstap, multipleQueue*queueSize), | ||
| quit: make(chan struct{}), | ||
| flushTimeout: flushTimeout, | ||
| errorCheckTimeout: errorCheckTimeout, |
There was a problem hiding this comment.
if you are not going to make it configurable, you don't need this property in dio, just use the constant where needed
There was a problem hiding this comment.
Making errorCheckInterval configurable so that we can make unit tests faster.
Yes, here are the steps to reproduce the issue. Set up:
Reproduce
|
Yes you are right about the randomness of the The fix is now much simpler, just change to use |
Signed-off-by: xyang378 <xyang378@bloomberg.net>
| if d.enc == nil { | ||
| atomic.AddUint32(&d.dropped, 1) | ||
| return nil | ||
| return noOutputError |
There was a problem hiding this comment.
If dnstap server is not available (e.g. misconfigured), returning the error here will result in frequent attempts to reconnect (on each DNSTAP message), whereas the original code implied one re-connection attempt per flush interval.
Not sure if the new behavior would be better.
There was a problem hiding this comment.
That's fair, this redial frequency might be too aggressive.
Hmm but the original behaviour just ignores the no encoder error, there should at least be some error logging. Now that I think about it, it would still be cleaner to decouple the flush logic from error handling logic like in my first commit, even though it's not needed for changing the behaviour, it's better for future maintenance
Signed-off-by: xyang378 <xyang378@bloomberg.net>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7619 +/- ##
==========================================
+ Coverage 55.70% 63.29% +7.59%
==========================================
Files 224 278 +54
Lines 10016 15132 +5116
==========================================
+ Hits 5579 9578 +3999
- Misses 3978 4871 +893
- Partials 459 683 +224 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: xyang378 <xyang378@bloomberg.net> CR comment
|
@rdrozhdzh @yongtang Hi, I have updated the PR with the code review changes, are there other things we need to do before merging this? |
1. Why is this pull request needed and what does it do?
Bug / Current behaviour
Both behaviours are caused by the same bug, when the internal queue is always full, we re-enter the for loop very frequently and keeps resetting the flush timer, so the timer never expires to enter the flush clause.
Changes
TimerandResettotime.Ticker()so that the flush/error reporting/redial actions are triggered every 1s2. Which issues (if any) are related?
None
3. Which documentation changes (if any) need to be made?
Don't think anything needs to be particularly noted on docs.
4. Does this introduce a backward incompatible change or deprecation?
No