Skip to content

[8.19] (backport #9886) fix: ensure EDOT subprocess shuts down gracefully on agent termination#9986

Merged
pkoutsovasilis merged 1 commit into8.19from
mergify/bp/8.19/pr-9886
Sep 16, 2025
Merged

[8.19] (backport #9886) fix: ensure EDOT subprocess shuts down gracefully on agent termination#9986
pkoutsovasilis merged 1 commit into8.19from
mergify/bp/8.19/pr-9886

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Sep 16, 2025

What does this PR do?

Ensures Elastic Agent gracefully shuts down the EDOT (collector) subprocess when the agent receives a terminating signal (e.g., SIGTERM) - i.e., when the OTelManager context is cancelled - instead of immediately killing it.

Key changes:

  • Replace collectorHandle.Stop(ctx context.Context) with Stop(waitTime time.Duration) and honor it in both embedded and subprocess execution modes.
  • Set waitTimeForStop to 30s (aligned with Beats subprocess defaults).
  • Add warnings when the supervised collector fails to stop gracefully or times out.
  • Make updateCh channel buffered to a size of 1 and drain before write (pattern used elsewhere) to avoid shutdown/reconfig delays.
  • Unit-Tests:
    • Extend test harness to allow configurable shutdown delays via TEST_SUPERVISED_COLLECTOR_DELAY.
    • Validate both outcomes:
      • Delay > 30s ⇒ subprocess is force-killed.
      • Delay < 30s ⇒ subprocess exits cleanly within the timeout.

Why is it important?

Previously, on shutdown the agent could terminate in a way that kills EDOT immediately, risking incomplete cleanup of telemetry pipelines and leaving resources in a bad state. Waiting (up to 30s) for a graceful EDOT exit improves cleanup, stability, and hybrid agent resilience.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No disruptive impact expected. The only behavior change is during shutdown: the agent now waits up to 30s for EDOT to exit gracefully before force-killing it. This improves stability and predictability for users..

How to test this PR locally

mage unitTest

Related issues

N/A


This is an automatic backport of pull request #9886 done by Mergify.

#9886)

* fix: ensure EDOT subprocess shuts down gracefully on agent termination

* fix: reword returned error

* doc: add comment to describe the functionality of testing.go

* fix: re-structure shutdown delay in test binary

* fix: utilise t.SetEnv in unit-tests

* feat: derive otel manager wait to stop timeout from agent.process.stop_timeout

(cherry picked from commit 3e1fc2b)
@mergify mergify bot added the backport label Sep 16, 2025
@mergify mergify bot requested a review from a team as a code owner September 16, 2025 14:06
@mergify mergify bot requested review from kaanyalti and ycombinator and removed request for a team September 16, 2025 14:06
@github-actions github-actions bot added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog labels Sep 16, 2025
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elastic-sonarqube
Copy link
Copy Markdown

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis merged commit ea7e054 into 8.19 Sep 16, 2025
21 of 22 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/8.19/pr-9886 branch September 16, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants