Skip to content

fix: ensure EDOT subprocess shuts down gracefully on agent termination#9886

Merged
pkoutsovasilis merged 6 commits intoelastic:mainfrom
pkoutsovasilis:fix/gracefully_shutdown_edot_if_agent_shutdown
Sep 16, 2025
Merged

fix: ensure EDOT subprocess shuts down gracefully on agent termination#9886
pkoutsovasilis merged 6 commits intoelastic:mainfrom
pkoutsovasilis:fix/gracefully_shutdown_edot_if_agent_shutdown

Conversation

@pkoutsovasilis
Copy link
Copy Markdown
Contributor

@pkoutsovasilis pkoutsovasilis commented Sep 11, 2025

What does this PR do?

Ensures Elastic Agent gracefully shuts down the EDOT (collector) subprocess when the agent receives a terminating signal (e.g., SIGTERM) - i.e., when the OTelManager context is cancelled - instead of immediately killing it.

Key changes:

  • Replace collectorHandle.Stop(ctx context.Context) with Stop(waitTime time.Duration) and honor it in both embedded and subprocess execution modes.
  • Set waitTimeForStop to 30s (aligned with Beats subprocess defaults).
  • Add warnings when the supervised collector fails to stop gracefully or times out.
  • Make updateCh channel buffered to a size of 1 and drain before write (pattern used elsewhere) to avoid shutdown/reconfig delays.
  • Unit-Tests:
    • Extend test harness to allow configurable shutdown delays via TEST_SUPERVISED_COLLECTOR_DELAY.
    • Validate both outcomes:
      • Delay > 30s ⇒ subprocess is force-killed.
      • Delay < 30s ⇒ subprocess exits cleanly within the timeout.

Why is it important?

Previously, on shutdown the agent could terminate in a way that kills EDOT immediately, risking incomplete cleanup of telemetry pipelines and leaving resources in a bad state. Waiting (up to 30s) for a graceful EDOT exit improves cleanup, stability, and hybrid agent resilience.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No disruptive impact expected. The only behavior change is during shutdown: the agent now waits up to 30s for EDOT to exit gracefully before force-killing it. This improves stability and predictability for users..

How to test this PR locally

mage unitTest

Related issues

N/A

@pkoutsovasilis pkoutsovasilis self-assigned this Sep 11, 2025
@pkoutsovasilis pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.19 Automated backport to the 8.19 branch labels Sep 11, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/gracefully_shutdown_edot_if_agent_shutdown branch from 0194d5f to e114ea5 Compare September 11, 2025 13:27
@pkoutsovasilis pkoutsovasilis force-pushed the fix/gracefully_shutdown_edot_if_agent_shutdown branch from e114ea5 to 455fd59 Compare September 11, 2025 15:32
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review September 11, 2025 20:50
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner September 11, 2025 20:50
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Copy Markdown
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks and a question on the select statement

Comment thread internal/pkg/otel/manager/manager.go Outdated
Comment thread internal/pkg/otel/manager/manager.go
Comment thread internal/pkg/otel/manager/testing/testing.go
pchila
pchila previously approved these changes Sep 12, 2025
Copy link
Copy Markdown
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

swiatekm
swiatekm previously approved these changes Sep 12, 2025
Copy link
Copy Markdown
Member

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor nitpicks and a non-blocking question.

Comment thread internal/pkg/otel/manager/testing/testing.go Outdated
Comment thread internal/pkg/otel/manager/manager.go
Comment thread internal/pkg/otel/manager/manager_test.go Outdated
Comment thread internal/pkg/otel/manager/manager_test.go Outdated
Comment thread internal/pkg/otel/manager/manager.go Outdated
@pkoutsovasilis pkoutsovasilis dismissed stale reviews from swiatekm and pchila via c997949 September 15, 2025 08:34
@elastic-sonarqube
Copy link
Copy Markdown

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pkoutsovasilis

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Sep 15, 2025

LGTM, @swiatekm can do the approving here since he still has unresolved comments.

@pkoutsovasilis pkoutsovasilis merged commit 3e1fc2b into elastic:main Sep 16, 2025
23 checks passed
mergify bot pushed a commit that referenced this pull request Sep 16, 2025
#9886)

* fix: ensure EDOT subprocess shuts down gracefully on agent termination

* fix: reword returned error

* doc: add comment to describe the functionality of testing.go

* fix: re-structure shutdown delay in test binary

* fix: utilise t.SetEnv in unit-tests

* feat: derive otel manager wait to stop timeout from agent.process.stop_timeout

(cherry picked from commit 3e1fc2b)
pkoutsovasilis added a commit that referenced this pull request Sep 16, 2025
#9886) (#9986)

* fix: ensure EDOT subprocess shuts down gracefully on agent termination

* fix: reword returned error

* doc: add comment to describe the functionality of testing.go

* fix: re-structure shutdown delay in test binary

* fix: utilise t.SetEnv in unit-tests

* feat: derive otel manager wait to stop timeout from agent.process.stop_timeout

(cherry picked from commit 3e1fc2b)

Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
v1v added a commit that referenced this pull request Sep 16, 2025
* upstream: (26 commits)
  fix: ensure EDOT subprocess shuts down gracefully on agent termination (#9886)
  [main][Automation] Update versions (#9976)
  Add Collector reference docs and automation (#9953)
  [beatreceivers] Integrate beatsauthextension (#9257)
  [main][Automation] Update versions (#9941)
  Update OTel components to v0.132.0/v1.38.0 (#9954)
  Enhancement/5235 wrap errors when marking upgrade (#9366)
  Mount Go build cache into crossbuild container (#9094)
  Liveness agent state (#9673)
  [main][Automation] Bump VM Image version to 1757725254 (#9942)
  Enhancement/5235 correctly wrap errors from copyActionDir and copyRunDirectory (#9349)
  [main][Automation] Update elastic/beats to afc53c0479ac (#9874)
  Add -coverpkg option when running unit test to calculate coverage across packages (#9913)
  Cache binaries downloaded for packaging locally (#9133)
  [main][Automation] Update versions (#9897)
  Disable flaky test TestBeatsReceiverLogs (#9891)
  Allow overriding AGENT_PACKAGE_VERSION and MANIFEST_URL when USE_PACKAGE_VERSION=true (#9864)
  add ingest-docs team as CODEOWNERS for release notes and docset.yml (#9865)
  fix: correct spelling of 'output' in various templates and monitoring code (#9827)
  k8s: Add comment around hostUsers for Universal Profiling deployments (#9847)
  ...
intxgo pushed a commit to intxgo/elastic-agent that referenced this pull request Sep 24, 2025
elastic#9886)

* fix: ensure EDOT subprocess shuts down gracefully on agent termination

* fix: reword returned error

* doc: add comment to describe the functionality of testing.go

* fix: re-structure shutdown delay in test binary

* fix: utilise t.SetEnv in unit-tests

* feat: derive otel manager wait to stop timeout from agent.process.stop_timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants