Skip to content
This repository was archived by the owner on Sep 21, 2023. It is now read-only.
This repository was archived by the owner on Sep 21, 2023. It is now read-only.

Correctly handle enqueued events affected by agent policy changes #49

@cmacknz

Description

@cmacknz

We need to think through all the edge cases that can arise when events in the shipper queue are affected by an agent policy change. For a concrete example, consider the case where a user removes an integration but events collected by that integration still reside in the shipper queue:

  1. User creates Agent policy rev.1 containing integration A and Integration B.
  2. Fleet server generates an API key with append permission to write to data stream for integration A and B.
  3. Elastic Agent receives and runs the Agent policy rev.1
  4. Elastic Agent needs to persist events to disk (events from integration B and A are persisted on disk).
  5. User removes integration B, Agent policy is updated to rev. 2.
  6. Fleet server generates an API key with append permission to write to data stream for integration A.
  7. Elastic Agent receives and runs Agent policy rev.2
  8. Elastic Agent acknowledges the configuration.
  9. Fleet-Server invalidates Elasticsearch api key.

In the case above the events for the removed integration B will never be able to be ingested by Elasticsearch after the API key has been changed. This series of events is worse with the disk queue because the number of events can be larger, but this situation would apply to the memory queue as well.

We must also consider that every policy change does not necessarily cause a problem. For example, changing the number of output workers does not affect events in the queue.

For policy changes that do affect enqueued events, there are several paths forward we could take to solve this problem:

  1. Decide that it is safe to drop events for integration B, and have a mechanism to do so reliably when the API key changes. This option is complicated by the shipper pipeline being unaware of agent policy changes and the ability to configure infinite retry of failed events.
  2. Ensure all affected events have been successfully sent and removed from the queue before acknowledging the policy change. In the V2 agent control protocol the the agent will could send the shipper expected state of stopped which the shipper can take it as a signal to flush all events. The gent doesn't accept it to be done and the policy rolled out until that unit is reported back as observed "stopped". So as soon as it gets "stopped" as the expected state, it reports "stopping" (aka. starting the flush), then "stopped" (aka. completely flushed).

Option 2 avoids data loss, but is the most complex path forward. There are multiple ways we could ensure all events affected by a policy change are drained from the shipper queue before acknowledging the policy change:

  1. Have the agent provision a second instance of the shipper process, with new events routed to the new second instance. The policy change is considered acknowledged when the original shipper exits successfully after flushing all events to the output. The system would need to handle the case where the first shipper never exits successfully. The primary downside with this solution is it temporarily doubles the number of queues and connections made the to the output.
  2. Have the shipper internally provision a second instance of its data pipeline, with all new events routed to the new pipeline. This is the same as the first option but with the pipeline duplicated in a single shipper process. The number of connections can be kept constant but the number of queues is doubled.
  3. A policy change emits a special meta event into the pipeline. When this event is read at the output the shipper knows all affected events have been flushed through the queue and it acknowledges the policy change. This avoids duplicating the queues and connections.

This is a complex issue with many possible solutions. Evaluate each of the proposed solutions (and consider new ones) to decide which path we should take to solve this issue. The outcome of this issue should be a meta issue with an implementation plan for solving this problem.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions