Skip to content

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

@cmacknz

Description

@cmacknz

Background

Starting from 8.6 the default global processors for Beats that are run by agent are configured in code instead of being read from the default Beat configuration file. Beats managed by agent no longer read a configuration file at startup and instead wait for their initial configuration to be sent by agent. This change was done in #34149.

The implementation from #34149 makes the processors global by configuring them for each input run by Elastic Agent. Looking at the beat-rendered-config.yml file available in the agent diagnostics for example shows the processors at the input level:

- data_stream:
    dataset: system.auth
    type: logs
  exclude_files:
  - .gz$
  id: logfile-system.auth-8913b026-f53e-43f4-909b-d7c91335f141
  index: logs-system.auth-vault
  multiline:
    match: after
    pattern: ^\s
  paths:
  - /var/log/auth.log*
  - /var/log/secure*
  processors:
  - add_host_metadata:
      when:
        not:
          contains:
            tags: forwarded
  - add_cloud_metadata: null
  - add_docker_metadata: null
  - add_kubernetes_metadata: null

This is in contrast to the default configuration file which defines a single instance of each processor for the process by defining them at the top level of the configuration (see Where are processors valid for details):

processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_kubernetes_metadata: ~

Problem

Similarly in 8.6 there was a change to the aws-s3 input to create a new beat.Client for each new SQS worker in #33658 to improve performance. This results in a new input pipeline being constructed for each SQS worker, each of which gets its own instance of the per input processors as of 8.6.

This was not a problem until 8.7, when it was discovered that each instance of a beat input pipeline was referencing an accidentally global instance of the per input processors. This was fixed in #34761. The change in #34761 now results in each input pipeline constructing a new instance of the global processors.

Each of these global processors is expensive to create and includes code to try to perform expensive work only at initialization time. The problem is this only works if there is a single instance of the processor, otherwise each unique instance of the processor attempts to reinitialize itself often performing the exact same request multiple times. For example:

  1. In the case of Docker or Kubernetes each instance is subscribing/watching the Docker/k8s API for changes. And each instance is keeping its own in-memory state of the Docker/k8s resources. This ideally should only be done once per process for both the sake of the Beat and the Docker/K8s API.
  2. The repeated construction of these processors by inputs like aws-s3 and filestream will be slow. These processors generally have a high instantiation cost. Like for docker the constructor creates a new docker client and tries making an API call before returning. Or in add_cloud_metadata there was in inadvertent change that introduces a 3s worst-case construction cost ([libbeat] add_cloud_metadata - startup blocked by AWS IMSDv2 token fetch #33058).

Impact

As of 8.7 we observing extremely high CPU usage for Beats run under agent and the agent itself in situations where inputs are frequently created. For example in the case of the add_cloud_metadata processor we are observing the agent itself being spammed by repeated log messages from the add_cloud_metadata initialization sequence:

{"log.level":"info","@timestamp":"2023-04-01T00:14:59.605Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.715Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.716Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.824Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.062Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.194Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.282Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.394Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.477Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.542Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.629Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}

This comes from the code below which includes a sync.Once block that is being defeated by a new instance of the processor being created for each individual input:

func (p *addCloudMetadata) init() {
p.initOnce.Do(func() {
result := p.fetchMetadata()
if result == nil {
p.logger.Info("add_cloud_metadata: hosting provider type not detected.")
return
}
p.metadata = result.metadata
p.logger.Infof("add_cloud_metadata: hosting provider type detected as %v, metadata=%v",
result.provider, result.metadata.String())
})
}
func (p *addCloudMetadata) getMeta() mapstr.M {
p.init()

Solution

When Beats run under agent we need to create the default global processors at the Beat process level, instead of the input level to match what is done in the global configuration file.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions