Skip to content

Upgrade actions can be ignored due to stale entries in bkgActions #9629

@pkoutsovasilis

Description

@pkoutsovasilis

Background

Actions dequeued from the dispatcher action store and dispatched to the coordinator upgrader:

  1. are not retried on failure
  2. Are not persisted across an agent restart (although they should be re-sent by fleet-server on the first checkin after restart?!)

Note: this is probably worth tracking as a separate investigation issue.

Issue

If Elastic-Defend and tamper protection of agent is enabled, action can remain stale in the bkgActions. Specifically we add the action invoking getAsyncContext but we return an err without emptying bkgActions in case Elastic-Defend can't acknowledge the upgrade action

asyncCtx, runAsync := h.getAsyncContext(ctx, a, ack)
if !runAsync {
return nil
}
if h.tamperProtectionFn() {
// Find inputs that want to receive UPGRADE action
// Endpoint needs to receive a signed UPGRADE action in order to be able to uncontain itself
state := h.coord.State()
ucs := findMatchingUnitsByActionType(state, a.Type())
if len(ucs) > 0 {
h.log.Debugf("handlerUpgrade: proxy/dispatch action '%+v'", a)
err := notifyUnitsOfProxiedAction(ctx, h.log, action, ucs, h.coord.PerformAction)
h.log.Debugf("handlerUpgrade: after action dispatched '%+v', err: %v", a, err)
if err != nil {
return err
}
} else {
// Log and continue
h.log.Debugf("No components running for %v action type", a.Type())
}
}

Impact

  • Upgrade actions may remain permanently stuck in bkgActions.
  • Subsequent upgrade attempts with the same version and source are ignored.
  • Likely the cause of multiple recent internal error reports.

For confirmed bugs, please report:

  • Version: All active releases
  • Operating System: All

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions