[Meta] Implement shipper performance testing

The Elastic agent data shipper is actively under development and we need a way to benchmark its performance as part of the agent system. Specifically we are interested in benchmarking the achievable throughput of a single agent using the shipper along with its CPU, memory, and disk IOPS overhead. Users care about the performance of the agent and we need a way to measure and improve it.

# Design
The proposed solution is to develop a new load generating input for the agent, which can be installed and configured as a standard agent integration. The test scenario can be changed by modifying the integration configuration or agent policy. Metrics will be collected using the existing agent monitoring features. Where the existing agent monitoring is not adequate, it should be enhanced so that all data necessary to diagnose performance issues is also available in the field. For example, all performance data should be available in the [existing agent metrics dashboard](https://github.com/elastic/integrations/pull/1529).

![Data Shipper Performance Testing@2x](https://user-images.githubusercontent.com/3466215/176021987-ac410237-7e72-49d4-8de2-11450edd9301.png)

The new load generating input should be developed as one of the first non-beat inputs in the [V2 agent input](https://github.com/elastic/elastic-agent-inputs) architecture. The load generator should be packaged into an agent load testing integration developed using the existing Elastic package tooling. Any agent is then capable of being load tested via installing the necessary integration.

Automated deployment and provisioning can ideally reuse the same tools used to provision Fleet managed agents for [end-to-end testing](https://github.com/elastic/e2e-testing) with minimal extra work. When testing Elasticsearch, ideally the instance used for fleet and monitoring data is separate from the instance receiving data from the shipper to avoid introducing instability into Fleet itself during stress tests.

The performance metrics resulting from each test can be queried out of the agent monitoring indices at the conclusion of each test. Profiles can be periodically collected via agent diagnostics or the /debug/pprof endpoint of the shipper.

The initial version of the agent load testing package will implement only a shipper client which it will use to write simulated or pre-recorded events at a configurable rate. Multiple tools exist that could be integrated into the load generator input to generate data on demand: [stream](https://github.com/elastic/stream), [integration corpos generator,](https://github.com/elastic/elastic-integration-corpus-generator-tool) [spigot](https://github.com/leehinman/spigot), or [flog](https://github.com/mingrammer/flog).

Future versions of the load testing package can be developed with the load generator input configured to act as the data source for other inputs to pull from. For example a filebeat instance could be started and configured to consume data from the load generator using the syslog protocol, enabling tests of the entire agent ingestion system. [Stream](https://github.com/elastic/stream) is already used to test integrations with elastic-package today and could serve as the starting point for this functionality.

# Implementation Plan

TBD. Insert a development plan with linked issues, including at least the following high level tasks:

1. Develop a load generator agent input, possibly based on https://github.com/elastic/stream and integrating synthetic data generation.
2. Develop and publish an agent load testing integration. Allow local testing of the load generator input using the `elastic-package` tool (see https://github.com/elastic/integrations/blob/main/CONTRIBUTING.md). 
3. Allow running performance tests locally, and collecting test results into a report document that can be ingested into Elasticsearch and tracked over time. Use the APM benchmark output format as a reference: https://github.com/elastic/apm-server/issues/7540
4. Update the [existing agent metrics dashboard](https://github.com/elastic/integrations/pull/1529) to include all relevant performance metrics if they are not already present.
6. Automate running performance tests on a daily basis. The key to integrating performance testing into CI will be creating repeatable hardware conditions, something several teams in Elastic have already solved.
7. Allow running performance tests on a PR basis, possibly triggered a dedicated label or as part of the existing E2E test suite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Implement shipper performance testing #57

Design

Implementation Plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Meta] Implement shipper performance testing #57

Description

Design

Implementation Plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions