add support for queue settings under outputs by leehinman · Pull Request #36693 · elastic/beats

leehinman · 2023-09-27T18:27:56Z

Proposed commit message

add support for idle_connection_timeout for ES output. This allows connections to be closed if they aren't being used.
add support for queue settings under output. Validation ensure only top level or output level is specified.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

Start Filebeat with following config file:

output.elasticsearch:
  idle_connection_timeout: 3s
  queue.mem:
    events: 1024
    flush.min_events: 2
    flush.timeout: 15s

Should show queue.max_events in the metrics to be 1024, and connections to elasticsearch should only stay open for 3 seconds. Connection status can be checked with netstat -an

Related issues

Closes Allow queue configuration to be specified under the output type #35615

mergify · 2023-09-27T18:28:32Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @leehinman? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-09-27T18:35:00Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-09-29T18:54:30.727+0000
Duration: 134 min 30 sec

Test stats 🧪

Test	Results
Failed	0
Passed	28382
Skipped	2013
Total	30395

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

andrewkroh

Could we make it an error to set both ^queue and output.*.queue?

libbeat/docs/queueconfig.asciidoc

libbeat/cmd/instance/beat.go

leehinman · 2023-09-28T16:36:59Z

Could we make it an error to set both ^queue and output.*.queue?

added validation, see what you think.

andrewkroh · 2023-09-28T17:11:04Z

libbeat/docs/queueconfig.asciidoc

+You can configure the type and behavior of the internal queue by
+setting options in the `queue` section of the +{beatname_lc}.yml+
+config file or by setting options in the `queue` section of the
+output. Only one queue type can be configured.  If both the top level


If both the top level queue section and the output section are specified the output section
takes precedence.

This is no longer true now that it will fail validation.

cmacknz · 2023-09-28T18:38:10Z

libbeat/cmd/instance/beat_test.go

+    queue:
+      mem:
+        events: 8096


I assume the disk queue can also be enabled this way? Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.

I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.

I could be convinced otherwise.

I assume the disk queue can also be enabled this way?

Yep

Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.

I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.

I could be convinced otherwise.

I'll code up a check and we can evaluate. My personal preference is not for Beats to have different features when run under agent. It makes it harder to debug if the the same config does different things when run under agent or not. But given that users can input any yaml, this might be acceptable.

The code for the check is small, but there will be work in ensuring the error message shows up in an obvious way in the elastic agent status output and in fleet.

I think I am fine with allowing the disk queue, but considering it unsupported and just not documenting it until we do some performance testing. This also allows us to retroactively declare we support it once the testing is done or the work is done to expose the configurations in fleet.

take a look at checkAgentDiskQueue in beat.go

we could enable disk queue as a tech preview or beta and perform the testing at a later sprint. for the most part the functionality has been in beats for a while.

Actually I've changed my mind on this, the disk queue needs changes in Elastic Agent to work properly so we shouldn't allow using it.

Agent needs to create a unique disk queue path per running Beat, or all of them will try to share the same queue which would lead to data loss or duplication. We also need to put the disk queue outside of the agent data directory by default so it doesn't have to be copied between upgrades. This isn't as simple as allowing us to turn it on in the configuration.

I will create a follow up issue specifically for this allowing this. For now let's forbid using the disk queue under agent to avoid allowing configurations that don't work properly.

See elastic/elastic-agent#3490 for the complications involved in enabling the disk queue for the Elastic Agent.

thanks for looking into this Craig.

cmacknz · 2023-09-29T18:26:45Z

libbeat/cmd/instance/beat.go

+
+// checkAgentDiskQueue should be run after management.NewManager() so
+// that publisher.UnderAgent will be set with correct value
+func checkAgentDiskQueue(bc *beatConfig) error {


I tested this and it doesn't seem to work:

sudo elastic-agent inspect ... outputs: default: api_key: sfga4ooBeuYcqqQuT4_b:h4h_6lRaTEqN48FSOwST-A hosts: - https://60a01e9179764ca0b9c2e2fbf47d6d67.eastus2.staging.azure.foundit.no:443 queue: disk: max_size: 10GB type: elasticsearch path: config: /Library/Elastic/Agent data: /Library/Elastic/Agent/data home: /Library/Elastic/Agent/data/elastic-agent-123ba9 logs: /Library/Elastic/Agent sudo elastic-agent status ┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent └─ status: (HEALTHY) Running

Steps to reproduce:

Build metricbeat and filebeat from this branch. I used cd x-pack/metricbeat && mage build.

Build or download an 8.11 snapshot agent.

Copy the built metricbeat and filebeat into the data/elastic-agent-$hash/components directory of the agent. cp ~/go/src/github.com/elastic/beats/x-pack/metricbeat/metricbeat /Users/cmackenzie/Downloads/elastic-agent-8.11.0-SNAPSHOT-darwin-aarch64 /data/elastic-agent-123ba9/components

Install the agent. I enrolled it with Fleet but you could test this with a standalone agent too.

Configure the disk queue in the output as shown above.

Expect to see an error but see nothing.

I also saw no obvious logs about the queue type being configured in the logs or metrics. I suspect the memory queue configuration might be getting ignored as well.

I did confirm the running version of the beat was the most recent commit from this branch.

For Fleet I had the following in the advanced yaml parameters box of the output, with the system integration installed with logs and monitoring enabled.

queue.disk: max_size: 10GB

This is what I see in the beat-rendered-config.yml for the log-default component:

outputs: elasticsearch: api_key: <REDACTED> bulk_max_size: 50 hosts: - https://<REDACTED>.eastus2.staging.azure.foundit.no:443 queue: disk: max_size: 10GB type: elasticsearch

Something on the agent output reload path might not be reloading the queue settings:

beats/x-pack/libbeat/management/managerV2.go

Lines 689 to 726 in 61c2d28

func (cm *BeatV2Manager) reloadOutput(unit *client.Unit) (bool, error) {

// Assuming that the output reloadable isn't a list, see createBeater() in cmd/instance/beat.go

output := cm.registry.GetReloadableOutput()

if output == nil {

return false, fmt.Errorf("failed to find beat reloadable type 'output'")

}

if unit == nil {

// output is being stopped

err := output.Reload(nil)

if err != nil {

return false, fmt.Errorf("failed to reload output: %w", err)

}

cm.lastOutputCfg = nil

cm.lastBeatOutputCfg = nil

return false, nil

}

expected := unit.Expected()

if expected.Config == nil {

// should not happen; hard stop

return false, fmt.Errorf("output unit has no config")

}

if cm.lastOutputCfg != nil && gproto.Equal(cm.lastOutputCfg, expected.Config) {

// configuration for the output did not change; do nothing

cm.logger.Debug("Skipped reloading output; configuration didn't change")

return false, nil

}

cm.logger.Debugf("Got output unit config '%s'", expected.Config.GetId())

if cm.stopOnOutputReload && cm.lastOutputCfg != nil {

cm.logger.Info("beat is restarting because output changed")

_ = unit.UpdateState(client.UnitStateStopping, "Restarting", nil)

cm.Stop()

return true, nil

}

I changed it to be a Validation check, which should happen whenever the config is unpacked. So that should fix the disk queue problem.

The reload may be more interesting. I think this will mean every output change requires a full restart since you need the queue settings earlier to setup the pipeline.

We already restart the Beats when the output configuration changes, it looks optional in the code but it is enabled in all the Beat spec files: https://github.com/elastic/elastic-agent/blob/123ba9ce80c9865f72fa3659b5cafe9b51954f49/specs/filebeat.spec.yml#L32-L33

- "-E" - "management.restart_on_output_change=true"

We also need restarts whenever any of the TLS configuration changes for example.

- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615

leehinman · 2023-09-29T21:12:03Z

quick note on why this is working with a standalone beat but not elastic-agent

For elastic-agent, Beats is started with a blank config. So that means when the promoteOutputQueueSettings runs, there aren't any outputs defined.

elastic-agent sends the output over the control protocol, which is handled by the manager. Right now, managers output reloader is created from a publish.OutputReloader. And we don't have a publish object until we create a new pipeline that requires the queue settings. So getting this to work with elastic-agent will require some more re-working of the output reload works and it's relationship to a pipeline.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023

mergify bot assigned leehinman Sep 27, 2023

leehinman added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Sep 27, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023

leehinman force-pushed the 35615_queue_under_output branch from 55975d4 to 1b4bd24 Compare September 27, 2023 18:32

leehinman marked this pull request as ready for review September 27, 2023 21:52

leehinman requested review from a team as code owners September 27, 2023 21:52

leehinman requested review from faec and fearful-symmetry September 27, 2023 21:52

andrewkroh reviewed Sep 28, 2023

View reviewed changes

libbeat/docs/queueconfig.asciidoc Outdated Show resolved Hide resolved

libbeat/cmd/instance/beat.go Outdated Show resolved Hide resolved

jlind23 requested a review from cmacknz September 28, 2023 06:19

andrewkroh approved these changes Sep 28, 2023

View reviewed changes

andrewkroh reviewed Sep 28, 2023

View reviewed changes

cmacknz reviewed Sep 28, 2023

View reviewed changes

leehinman added the docs label Sep 28, 2023

leehinman force-pushed the 35615_queue_under_output branch from d61263d to 61c2d28 Compare September 29, 2023 13:49

cmacknz mentioned this pull request Sep 29, 2023

Support the Beats disk queue in Elastic Agent elastic/elastic-agent#3490

Open

cmacknz reviewed Sep 29, 2023

View reviewed changes

leehinman added 4 commits September 29, 2023 13:53

add support for queue settings under outputs

c3b29ba

- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615

update changelog with PR number

fdeeb93

fix lint check

196b5dd

add config validation

d828900

leehinman added 3 commits September 29, 2023 13:53

remove doc about top level precedence

17c5aa7

add check for underAgent and DiskQueue

7e48d6f

catch elastic-agent and disk queue at validation

476780e

leehinman force-pushed the 35615_queue_under_output branch from 61c2d28 to 476780e Compare September 29, 2023 18:54

leehinman closed this Sep 29, 2023

leehinman mentioned this pull request Oct 10, 2023

add support for queue settings under outputs #36788

Merged

6 tasks

	func (cm BeatV2Manager) reloadOutput(unit client.Unit) (bool, error) {
	// Assuming that the output reloadable isn't a list, see createBeater() in cmd/instance/beat.go
	output := cm.registry.GetReloadableOutput()
	if output == nil {
	return false, fmt.Errorf("failed to find beat reloadable type 'output'")
	}

	if unit == nil {
	// output is being stopped
	err := output.Reload(nil)
	if err != nil {
	return false, fmt.Errorf("failed to reload output: %w", err)
	}
	cm.lastOutputCfg = nil
	cm.lastBeatOutputCfg = nil
	return false, nil
	}

	expected := unit.Expected()
	if expected.Config == nil {
	// should not happen; hard stop
	return false, fmt.Errorf("output unit has no config")
	}

	if cm.lastOutputCfg != nil && gproto.Equal(cm.lastOutputCfg, expected.Config) {
	// configuration for the output did not change; do nothing
	cm.logger.Debug("Skipped reloading output; configuration didn't change")
	return false, nil
	}

	cm.logger.Debugf("Got output unit config '%s'", expected.Config.GetId())

	if cm.stopOnOutputReload && cm.lastOutputCfg != nil {
	cm.logger.Info("beat is restarting because output changed")
	_ = unit.UpdateState(client.UnitStateStopping, "Restarting", nil)
	cm.Stop()
	return true, nil
	}

Conversation

leehinman commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Checklist

Author's Checklist

How to test this PR locally

Related issues

Uh oh!

mergify bot commented Sep 27, 2023

Uh oh!

elasticmachine commented Sep 27, 2023 • edited by jenkins-beats-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Uh oh!

andrewkroh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leehinman commented Sep 28, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leehinman commented Sep 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leehinman commented Sep 27, 2023 •

edited

Loading

elasticmachine commented Sep 27, 2023 •

edited by jenkins-beats-ci bot

Loading