Skip to content
This repository was archived by the owner on Sep 21, 2023. It is now read-only.
This repository was archived by the owner on Sep 21, 2023. It is now read-only.

[Meta] Implement shipper performance testing #57

@cmacknz

Description

@cmacknz

The Elastic agent data shipper is actively under development and we need a way to benchmark its performance as part of the agent system. Specifically we are interested in benchmarking the achievable throughput of a single agent using the shipper along with its CPU, memory, and disk IOPS overhead. Users care about the performance of the agent and we need a way to measure and improve it.

Design

The proposed solution is to develop a new load generating input for the agent, which can be installed and configured as a standard agent integration. The test scenario can be changed by modifying the integration configuration or agent policy. Metrics will be collected using the existing agent monitoring features. Where the existing agent monitoring is not adequate, it should be enhanced so that all data necessary to diagnose performance issues is also available in the field. For example, all performance data should be available in the existing agent metrics dashboard.

Data Shipper Performance Testing@2x

The new load generating input should be developed as one of the first non-beat inputs in the V2 agent input architecture. The load generator should be packaged into an agent load testing integration developed using the existing Elastic package tooling. Any agent is then capable of being load tested via installing the necessary integration.

Automated deployment and provisioning can ideally reuse the same tools used to provision Fleet managed agents for end-to-end testing with minimal extra work. When testing Elasticsearch, ideally the instance used for fleet and monitoring data is separate from the instance receiving data from the shipper to avoid introducing instability into Fleet itself during stress tests.

The performance metrics resulting from each test can be queried out of the agent monitoring indices at the conclusion of each test. Profiles can be periodically collected via agent diagnostics or the /debug/pprof endpoint of the shipper.

The initial version of the agent load testing package will implement only a shipper client which it will use to write simulated or pre-recorded events at a configurable rate. Multiple tools exist that could be integrated into the load generator input to generate data on demand: stream, integration corpos generator, spigot, or flog.

Future versions of the load testing package can be developed with the load generator input configured to act as the data source for other inputs to pull from. For example a filebeat instance could be started and configured to consume data from the load generator using the syslog protocol, enabling tests of the entire agent ingestion system. Stream is already used to test integrations with elastic-package today and could serve as the starting point for this functionality.

Implementation Plan

TBD. Insert a development plan with linked issues, including at least the following high level tasks:

  1. Develop a load generator agent input, possibly based on https://github.com/elastic/stream and integrating synthetic data generation.
  2. Develop and publish an agent load testing integration. Allow local testing of the load generator input using the elastic-package tool (see https://github.com/elastic/integrations/blob/main/CONTRIBUTING.md).
  3. Allow running performance tests locally, and collecting test results into a report document that can be ingested into Elasticsearch and tracked over time. Use the APM benchmark output format as a reference: Benchmark 2.0 production ready apm-server#7540
  4. Update the existing agent metrics dashboard to include all relevant performance metrics if they are not already present.
  5. Automate running performance tests on a daily basis. The key to integrating performance testing into CI will be creating repeatable hardware conditions, something several teams in Elastic have already solved.
  6. Allow running performance tests on a PR basis, possibly triggered a dedicated label or as part of the existing E2E test suite.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions