add support for queue settings under outputs#36693
add support for queue settings under outputs#36693leehinman wants to merge 7 commits intoelastic:mainfrom
Conversation
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
55975d4 to
1b4bd24
Compare
andrewkroh
left a comment
There was a problem hiding this comment.
Could we make it an error to set both ^queue and output.*.queue?
added validation, see what you think. |
libbeat/docs/queueconfig.asciidoc
Outdated
| You can configure the type and behavior of the internal queue by | ||
| setting options in the `queue` section of the +{beatname_lc}.yml+ | ||
| config file or by setting options in the `queue` section of the | ||
| output. Only one queue type can be configured. If both the top level |
There was a problem hiding this comment.
If both the top level queue section and the output section are specified the output section
takes precedence.
This is no longer true now that it will fail validation.
| queue: | ||
| mem: | ||
| events: 8096 |
There was a problem hiding this comment.
I assume the disk queue can also be enabled this way? Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.
I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.
I could be convinced otherwise.
There was a problem hiding this comment.
I assume the disk queue can also be enabled this way?
Yep
Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.
I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.
I could be convinced otherwise.
I'll code up a check and we can evaluate. My personal preference is not for Beats to have different features when run under agent. It makes it harder to debug if the the same config does different things when run under agent or not. But given that users can input any yaml, this might be acceptable.
There was a problem hiding this comment.
The code for the check is small, but there will be work in ensuring the error message shows up in an obvious way in the elastic agent status output and in fleet.
I think I am fine with allowing the disk queue, but considering it unsupported and just not documenting it until we do some performance testing. This also allows us to retroactively declare we support it once the testing is done or the work is done to expose the configurations in fleet.
There was a problem hiding this comment.
take a look at checkAgentDiskQueue in beat.go
There was a problem hiding this comment.
we could enable disk queue as a tech preview or beta and perform the testing at a later sprint. for the most part the functionality has been in beats for a while.
There was a problem hiding this comment.
Actually I've changed my mind on this, the disk queue needs changes in Elastic Agent to work properly so we shouldn't allow using it.
Agent needs to create a unique disk queue path per running Beat, or all of them will try to share the same queue which would lead to data loss or duplication. We also need to put the disk queue outside of the agent data directory by default so it doesn't have to be copied between upgrades. This isn't as simple as allowing us to turn it on in the configuration.
I will create a follow up issue specifically for this allowing this. For now let's forbid using the disk queue under agent to avoid allowing configurations that don't work properly.
There was a problem hiding this comment.
See elastic/elastic-agent#3490 for the complications involved in enabling the disk queue for the Elastic Agent.
There was a problem hiding this comment.
thanks for looking into this Craig.
d61263d to
61c2d28
Compare
libbeat/cmd/instance/beat.go
Outdated
|
|
||
| // checkAgentDiskQueue should be run after management.NewManager() so | ||
| // that publisher.UnderAgent will be set with correct value | ||
| func checkAgentDiskQueue(bc *beatConfig) error { |
There was a problem hiding this comment.
I tested this and it doesn't seem to work:
sudo elastic-agent inspect
...
outputs:
default:
api_key: sfga4ooBeuYcqqQuT4_b:h4h_6lRaTEqN48FSOwST-A
hosts:
- https://60a01e9179764ca0b9c2e2fbf47d6d67.eastus2.staging.azure.foundit.no:443
queue:
disk:
max_size: 10GB
type: elasticsearch
path:
config: /Library/Elastic/Agent
data: /Library/Elastic/Agent/data
home: /Library/Elastic/Agent/data/elastic-agent-123ba9
logs: /Library/Elastic/Agent
sudo elastic-agent status
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
└─ status: (HEALTHY) RunningSteps to reproduce:
- Build metricbeat and filebeat from this branch. I used
cd x-pack/metricbeat && mage build. - Build or download an 8.11 snapshot agent.
- Copy the built metricbeat and filebeat into the
data/elastic-agent-$hash/componentsdirectory of the agent.cp ~/go/src/github.com/elastic/beats/x-pack/metricbeat/metricbeat /Users/cmackenzie/Downloads/elastic-agent-8.11.0-SNAPSHOT-darwin-aarch64 /data/elastic-agent-123ba9/components - Install the agent. I enrolled it with Fleet but you could test this with a standalone agent too.
- Configure the disk queue in the output as shown above.
- Expect to see an error but see nothing.
I also saw no obvious logs about the queue type being configured in the logs or metrics. I suspect the memory queue configuration might be getting ignored as well.
I did confirm the running version of the beat was the most recent commit from this branch.
There was a problem hiding this comment.
For Fleet I had the following in the advanced yaml parameters box of the output, with the system integration installed with logs and monitoring enabled.
queue.disk:
max_size: 10GBThere was a problem hiding this comment.
This is what I see in the beat-rendered-config.yml for the log-default component:
outputs:
elasticsearch:
api_key: <REDACTED>
bulk_max_size: 50
hosts:
- https://<REDACTED>.eastus2.staging.azure.foundit.no:443
queue:
disk:
max_size: 10GB
type: elasticsearch
There was a problem hiding this comment.
Something on the agent output reload path might not be reloading the queue settings:
beats/x-pack/libbeat/management/managerV2.go
Lines 689 to 726 in 61c2d28
There was a problem hiding this comment.
I changed it to be a Validation check, which should happen whenever the config is unpacked. So that should fix the disk queue problem.
The reload may be more interesting. I think this will mean every output change requires a full restart since you need the queue settings earlier to setup the pipeline.
There was a problem hiding this comment.
We already restart the Beats when the output configuration changes, it looks optional in the code but it is enabled in all the Beat spec files: https://github.com/elastic/elastic-agent/blob/123ba9ce80c9865f72fa3659b5cafe9b51954f49/specs/filebeat.spec.yml#L32-L33
- "-E"
- "management.restart_on_output_change=true"We also need restarts whenever any of the TLS configuration changes for example.
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
61c2d28 to
476780e
Compare
|
quick note on why this is working with a standalone beat but not elastic-agent For elastic-agent, Beats is started with a blank config. So that means when the elastic-agent sends the output over the control protocol, which is handled by the manager. Right now, managers output reloader is created from a publish.OutputReloader. And we don't have a publish object until we create a new pipeline that requires the queue settings. So getting this to work with elastic-agent will require some more re-working of the output reload works and it's relationship to a pipeline. |
Proposed commit message
add support for
idle_connection_timeoutfor ES output. This allows connections to be closed if they aren't being used.add support for queue settings under output. Validation ensure only top level or output level is specified.
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Author's Checklist
How to test this PR locally
Start Filebeat with following config file:
Should show
queue.max_eventsin the metrics to be 1024, and connections to elasticsearch should only stay open for 3 seconds. Connection status can be checked withnetstat -anRelated issues