Stop re-using processors defined in the config#34761
Conversation
After introducing the `SafeProcessor` wrapper in elastic#34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart.
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
|
Looks like the bug was introduced in #17655 (Apr, 2020) |
cmacknz
left a comment
There was a problem hiding this comment.
Thanks for tracking this down!
faec
left a comment
There was a problem hiding this comment.
Good find!
One concern: this is marked as closing #34716, but that issue isn't really about processor reuse, but about the fact that when we detect processor reuse we go into a deadlock. I believe this PR addresses at least one cause of processor reuse, but if there are others we don't know about, the deadlock itself would still be there. This fix looks good but I'd rather leave the linked issue open until we're sure that SafeProcessor doesn't deadlock when it is triggered.
Should we downgrade the // Run allows to run processor only when `Close` was not called prior
func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
if atomic.LoadUint32(&p.closed) == 1 {
return nil, ErrClosed
} |
|
Actually, I'm not sure using a log here will help. Process failure already result in just a log failure: beats/libbeat/publisher/pipeline/client.go Lines 97 to 103 in 331f792 The problem is that this log is for every event so it might have been overwhelming both the beat and the the agent log collection. Potentially the monitoring filebeat is receiving this log line for every event ingested by agent. |
|
/test x-pack/libbeat-goIntegTest |
|
My preference is to merge this and auto-close the issue. Then we can reopen the issue if we find another instance of this problem. We have no further action to take after merging this PR besides testing and observation. I think reporting this error will be hard, because using a log line will potentially log once per event unless we add a rate limit for that log line or switch to using something more performant but harder to notice like a metrics counter. |
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) Co-authored-by: Denis <denis.rechkunov@elastic.co>
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) Co-authored-by: Denis <denis.rechkunov@elastic.co>
|
@faec this is marked as closing #34716 because it's fixing the error reported there. If you manage to reproduce this error again please feel free to re-open it. Could you elaborate more about that deadlock behaviour you're talking about? In my testing I didn't observe any deadlock in Filebeat. And the most important: let's not underestimate the danger of using closed processors. Some of the processors are using background tasks to fetch data like We cannot use closed processors. We should rather produce a fatal error than let customer's data to be corrupted. I doubt there are more occurrences of processors re-use, I checked a lot of places but if there are more – we can easily identify them now thanks to |
#34764) * Stop re-using processors defined in the config (#34761) * Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go * Resolve conflicts, fix changelog * Add new line to changelog * Revert comment auto-formatting --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>
It's a follow-up to elastic#34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.
It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.
* Add test for the processor re-use issue (#34870) It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go * Resolve conflicts --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>
* Add test for the processor re-use issue (#34870) It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go # filebeat/channel/runner_test.go * Resolve conflicts --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>
|
Is there any expected release date? When using elastic for alerting, it generates a lot of false positives. |
|
@icc-garciaju not sure what alerting has to do with this issue but this fix has been released with 8.7.0 and was backported to the next 7.17. |
|
Is it included on 8.7.0 of the agent? Is it being backported to 8.6? I was told by elastic support it would be on the next release. |
yes.
It was but I'm not sure if there was another 8.6.x release since then. You can always see the backports in the labels of our PRs.
If the issue you had was about having these errors in logs: Then it should be fixed in 8.7.0. |
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry
It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.
What does this PR do?
After introducing the
SafeProcessorwrapper in#34647 we started returning errors when a processor is being used after its
Closefunction has been called.This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on
Closebut re-used processors somewhere.After a long investigation such code that's re-using processors was finally found.
This is the change that removes re-using the processors and instantiates them on each input restart.
Looks like the bug was introduced in #17655 (Apr, 2020)
Why is it important?
Fixes dropped events, panics and error spam in logs.
Checklist
- [] My code follows the style guidelines of this project- [ ] I have commented my code, particularly in hard-to-understand areas- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.How to test this PR locally
elastic-package stack upelastic-agentwith this policy:{ "log.level": "error", "@timestamp": "2023-03-02T11:59:42.394Z", "message": "Failed to publish event: attempt to use a closed processor", "component": { "binary": "filebeat", "dataset": "elastic_agent.filebeat", "id": "filestream-monitoring", "type": "filestream" }, "log": { "source": "filestream-monitoring" }, "log.logger": "publisher", "log.origin": { "file.line": 102, "file.name": "pipeline/client.go" }, "service.name": "filebeat", "ecs.version": "1.6.0" }After this change you should not see these errors anymore.
Related issues
SafeProcessorwrapper #34647