Stop re-using processors defined in the config by rdner · Pull Request #34761 · elastic/beats

rdner · 2023-03-07T19:01:44Z

What does this PR do?

After introducing the SafeProcessor wrapper in
#34647 we started returning errors when a processor is being used after its Close function has been called.

This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on Close but re-used processors somewhere.

After a long investigation such code that's re-using processors was finally found.

This is the change that removes re-using the processors and instantiates them on each input restart.

Looks like the bug was introduced in #17655 (Apr, 2020)

Why is it important?

Fixes dropped events, panics and error spam in logs.

Checklist

~~- [] My code follows the style guidelines of this project~~
~~- [ ] I have commented my code, particularly in hard-to-understand areas~~
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
~~- [ ] I have added tests that prove my fix is effective or that my feature works~~

I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run elastic-package stack up
Run elastic-agent with this policy:

outputs:
  default:
    type: elasticsearch
    log_level: debug
    enabled: true
    hosts: ["https://127.0.0.1:9200"]
    username: "elastic"
    password: "changeme"
    allow_older_versions: true
    ssl:
      verification_mode: none
    shipper:
      enabled: true

inputs:
  - type: system/metrics
    id: unique-system-metrics-input
    data_stream.namespace: default
    use_output: default
    streams:
      - metricset: cpu
        data_stream.dataset: system.cpu
      - metricset: memory
        data_stream.dataset: system.memory
      - metricset: network
        data_stream.dataset: system.network
      - metricset: filesystem
        data_stream.dataset: system.filesystem

Before this change you'd observe errors like this:

{
  "log.level": "error",
  "@timestamp": "2023-03-02T11:59:42.394Z",
  "message": "Failed to publish event: attempt to use a closed processor",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-monitoring"
  },
  "log.logger": "publisher",
  "log.origin": {
    "file.line": 102,
    "file.name": "pipeline/client.go"
  },
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

After this change you should not see these errors anymore.

Related issues

After introducing the `SafeProcessor` wrapper in elastic#34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart.

elasticmachine · 2023-03-07T19:10:59Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine · 2023-03-07T19:16:03Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-03-07T21:34:02.432+0000
Duration: 31 min 43 sec

Test stats 🧪

Test	Results
Failed	0
Passed	64
Skipped	1
Total	65

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

rdner · 2023-03-07T20:04:13Z

Looks like the bug was introduced in #17655 (Apr, 2020)

filebeat/channel/runner.go

cmacknz

Thanks for tracking this down!

faec

Good find!

One concern: this is marked as closing #34716, but that issue isn't really about processor reuse, but about the fact that when we detect processor reuse we go into a deadlock. I believe this PR addresses at least one cause of processor reuse, but if there are others we don't know about, the deadlock itself would still be there. This fix looks good but I'd rather leave the linked issue open until we're sure that SafeProcessor doesn't deadlock when it is triggered.

cmacknz · 2023-03-07T21:09:30Z

This fix looks good but I'd rather leave the linked issue open until we're sure that SafeProcessor doesn't deadlock when it is triggered.

Should we downgrade the ErrClosed returned by SafeProcessor to just a warning log? That would still let us find problems, without having them be stop the world errors.

// Run allows to run processor only when `Close` was not called prior
func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
	if atomic.LoadUint32(&p.closed) == 1 {
		return nil, ErrClosed
	}

cmacknz · 2023-03-07T21:17:07Z

Actually, I'm not sure using a log here will help. Process failure already result in just a log failure:

beats/libbeat/publisher/pipeline/client.go

Lines 97 to 103 in 331f792

    
           event, err = c.processors.Run(event) 
        
           publish = event != nil 
        
           if err != nil { 
        
           	// If we introduce a dead-letter queue, this is where we should 
        
           	// route the event to it. 
        
           	log.Errorf("Failed to publish event: %v", err) 
        
           }

The problem is that this log is for every event so it might have been overwhelming both the beat and the the agent log collection. Potentially the monitoring filebeat is receiving this log line for every event ingested by agent.

cmacknz · 2023-03-07T21:33:48Z

/test x-pack/libbeat-goIntegTest

cmacknz · 2023-03-07T22:16:16Z

My preference is to merge this and auto-close the issue. Then we can reopen the issue if we find another instance of this problem. We have no further action to take after merging this PR besides testing and observation.

I think reporting this error will be hard, because using a log line will potentially log once per event unless we add a rate limit for that log line or switch to using something more performant but harder to notice like a metrics counter.

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) Co-authored-by: Denis <denis.rechkunov@elastic.co>

rdner · 2023-03-08T10:05:42Z

@faec this is marked as closing #34716 because it's fixing the error reported there. If you manage to reproduce this error again please feel free to re-open it.

Could you elaborate more about that deadlock behaviour you're talking about? In my testing I didn't observe any deadlock in Filebeat.

And the most important: let's not underestimate the danger of using closed processors. Some of the processors are using background tasks to fetch data like add_kubernetes_metadata does for example. When you close such processors they stop updating their data and would attach outdated data to all events when Run is called after Close. This would silently corrupt our customer's data and we would never know. I'm pretty sure this was already happening because of the bug addressed in this PR.

We cannot use closed processors. We should rather produce a fatal error than let customer's data to be corrupted.

I doubt there are more occurrences of processors re-use, I checked a lot of places but if there are more – we can easily identify them now thanks to SafeProcessor wrapper.

#34764) * Stop re-using processors defined in the config (#34761) * Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go * Resolve conflicts, fix changelog * Add new line to changelog * Revert comment auto-formatting --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>

It's a follow-up to elastic#34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go # filebeat/channel/runner_test.go

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8)

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) Co-authored-by: Denis <denis.rechkunov@elastic.co>

* Add test for the processor re-use issue (#34870) It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go * Resolve conflicts --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>

* Add test for the processor re-use issue (#34870) It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go # filebeat/channel/runner_test.go * Resolve conflicts --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>

icc-garciaju · 2023-04-14T05:41:30Z

Is there any expected release date? When using elastic for alerting, it generates a lot of false positives.

rdner · 2023-04-17T12:41:18Z

@icc-garciaju not sure what alerting has to do with this issue but this fix has been released with 8.7.0 and was backported to the next 7.17.

icc-garciaju · 2023-04-17T12:47:56Z

Is it included on 8.7.0 of the agent? Is it being backported to 8.6? I was told by elastic support it would be on the next release.
As alerting is based on querying documents for logs or metrics, if the agent stops sending logs, the alerts go off.

rdner · 2023-04-17T12:58:26Z

@icc-garciaju

Is it included on 8.7.0 of the agent?

yes.

Is it being backported to 8.6

It was but I'm not sure if there was another 8.6.x release since then. You can always see the backports in the labels of our PRs.

if the agent stops sending logs, the alerts go off.

If the issue you had was about having these errors in logs:

Harvester crashed with: harvester panic with: close of closed channel

Then it should be fixed in 8.7.0.

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.

rdner added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.7.0 Automated backport with mergify labels Mar 7, 2023

rdner self-assigned this Mar 7, 2023

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 7, 2023

rdner mentioned this pull request Mar 7, 2023

Filebeat monitoring enters infinite error loop for "closed processor" #34716

Closed

rdner added 2 commits March 7, 2023 20:08

Fix linter issues

c90f3de

Add changelog entry

ce675d5

rdner marked this pull request as ready for review March 7, 2023 19:10

rdner requested a review from a team as a code owner March 7, 2023 19:10

rdner requested review from cmacknz and faec and removed request for a team March 7, 2023 19:10

cmacknz requested a review from leehinman March 7, 2023 20:05

cmacknz reviewed Mar 7, 2023

View reviewed changes

filebeat/channel/runner.go Show resolved Hide resolved

cmacknz approved these changes Mar 7, 2023

View reviewed changes

rdner mentioned this pull request Mar 7, 2023

add_kubernetes_metadata panics when a Beat shuts down #34219

Closed

faec approved these changes Mar 7, 2023

View reviewed changes

cmacknz added the backport-v8.6.0 Automated backport with mergify label Mar 7, 2023

cmacknz merged commit 5cfe62c into elastic:main Mar 7, 2023

This was referenced Mar 7, 2023

[7.17](backport #34761) Stop re-using processors defined in the config #34764

Merged

[8.6](backport #34761) Stop re-using processors defined in the config #34765

Merged

[8.7](backport #34761) Stop re-using processors defined in the config #34766

Merged

This was referenced Mar 8, 2023

Cover possible processor re-use with tests #34783

Closed

K8s Integration does not report correct container.id when container restarts elastic/integrations#5348

Closed

ShourieG mentioned this pull request Mar 16, 2023

Cloudflare Logpush Integration not working reliably with S3/SQS elastic/integrations#5526

Closed

rdner added a commit to rdner/beats that referenced this pull request Mar 21, 2023

Add test for the processor re-use issue

5e051d4

It's a follow-up to elastic#34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.

rdner mentioned this pull request Mar 21, 2023

Add test for the processor re-use issue #34870

Merged

3 tasks

rdner added a commit that referenced this pull request Mar 21, 2023

Add test for the processor re-use issue (#34870)

3d917c8

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.

rdner mentioned this pull request Mar 28, 2023

Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

Closed

cmacknz mentioned this pull request Apr 3, 2023

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

Closed

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Add test for the processor re-use issue (#34870)

bd933ac

It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.

rdner deleted the stop-re-using-processors branch October 24, 2024 08:32

rdner mentioned this pull request Feb 19, 2025

build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/azidentity from 1.7.0 to 1.8.2 #42690

Merged

Conversation

rdner commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Uh oh!

elasticmachine commented Mar 7, 2023

Uh oh!

elasticmachine commented Mar 7, 2023 • edited by jenkins-beats-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Uh oh!

rdner commented Mar 7, 2023

Uh oh!

Uh oh!

cmacknz left a comment

Choose a reason for hiding this comment

Uh oh!

faec left a comment

Choose a reason for hiding this comment

Uh oh!

cmacknz commented Mar 7, 2023

Uh oh!

cmacknz commented Mar 7, 2023

Uh oh!

cmacknz commented Mar 7, 2023

Uh oh!

cmacknz commented Mar 7, 2023

Uh oh!

rdner commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icc-garciaju commented Apr 14, 2023

Uh oh!

rdner commented Apr 17, 2023

Uh oh!

icc-garciaju commented Apr 17, 2023

Uh oh!

rdner commented Apr 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rdner commented Mar 7, 2023 •

edited

Loading

elasticmachine commented Mar 7, 2023 •

edited by jenkins-beats-ci bot

Loading

rdner commented Mar 8, 2023 •

edited

Loading