The Elasticsearch output should not report itself as degraded based only on the time between events

The implementation in https://github.com/elastic/elastic-agent-shipper/issues/174 was incompletely specified. We currently consider the shipper Elasticsearch output degraded whenever it has not written events to Elasticsearch within the past 30 seconds: https://github.com/elastic/elastic-agent-shipper/pull/239

This is an ok proxy for inability to connect to Elasticsearch, but does not consider the impact on low volume log sources. Users could tune the timeout, but this isn't something they've traditionally had to do and may lead to false positive degraded states.

Instead we should only mark the shipper as degraded when we have not published events for 30 seconds, and we have detected an explicitly error attempting to connect to Elasticsearch. For example this would include connection refused errors, failed DNS lookups, or invalid credentials.

The most common reasons for failing to connect to Elasticsearch would be incorrect proxy configurations, connectivity outages, or invalidated API keys. We should address these cases specifically instead of using a catch all timeout that makes assumptions about the steady state event rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Elasticsearch output should not report itself as degraded based only on the time between events #301

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The Elasticsearch output should not report itself as degraded based only on the time between events #301

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions