Skip to content

Liveness agent state#9673

Merged
nkvoll merged 6 commits intoelastic:mainfrom
nkvoll:liveness-agent-state
Sep 15, 2025
Merged

Liveness agent state#9673
nkvoll merged 6 commits intoelastic:mainfrom
nkvoll:liveness-agent-state

Conversation

@nkvoll
Copy link
Copy Markdown
Contributor

@nkvoll nkvoll commented Sep 1, 2025

What does this PR do?

This PR includes the aggregated status of the agent node to the liveness health check.

As a bonus, it also adds status code assertion to the tests, which were missing before. (All liveness/readiness tests were passing without any assertions).

Why is it important?

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Liveness probes will now fail if the configuration is invalid, likely causing the container to be restarted (see https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#liveness-probe).

How to test this PR locally

  1. Create an elastic-agent.yml file with an invalid output, i.e set use_output: nonexistent
  2. Start elastic-agent with relevant monitoring endpoints enabled.
  3. Verify that the agent is failed with elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (FAILED) Invalid component model: failed to render components: invalid 'inputs.0.use_output', references an unknown output 'nonexistent'
   └─ info
      ├─ id: e1a1e08b-9b0c-4394-a024-d35b823d415b
      ├─ version: 9.2.0
      └─ commit: ff80471809aca1f2280ce55f0e24f85cefec5d55
  1. Liveness probes should fail:
$ curl -w 'HTTP %{http_code}\n' 'http://localhost:6792/liveness?failon=degraded'
HTTP 500
$ curl -w 'HTTP %{http_code}\n' 'http://localhost:6792/liveness?failon=failed'
HTTP 500
$ curl -w 'HTTP %{http_code}\n' 'http://localhost:6792/liveness?failon=heartbeat'
HTTP 200

Related issues

@nkvoll nkvoll requested a review from a team as a code owner September 1, 2025 13:44
@mergify mergify bot assigned nkvoll Sep 1, 2025
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Sep 1, 2025

This pull request does not have a backport label. Could you fix it @nkvoll? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

Comment thread internal/pkg/agent/application/monitoring/liveness.go
@nkvoll
Copy link
Copy Markdown
Contributor Author

nkvoll commented Sep 1, 2025

From my testing, if this is the startup-state of the agent, it doesn't seem to start any components, but if configuration is edited while the agent is running, it keeps all existing components as-is.

This makes me wonder if what currently happens in the liveness endpoint should be happening in the readiness endpoint instead. Worth discussing? /cc @cmacknz @blakerouse

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 2, 2025
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Sep 2, 2025

This makes me wonder if what currently happens in the liveness endpoint should be happening in the readiness endpoint instead. Worth discussing? /cc @cmacknz @blakerouse

I think this "Invalid component model" model error would definitely make sense in the readiness endpoint. Agent isn't "ready to accept traffic" in this state.

I don't really see a reason why the readiness and liveness endpoints can't be the same implementation, the liveness endpoint right now can just be optionally extended to detect more things.

ycombinator
ycombinator previously approved these changes Sep 3, 2025
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please add a CHANGELOG fragment by installing elastic-agent-changelog-tool, running elastic-agent-changelog-tool new from the root of the elastic-agent repo folder, and then editing the file that's generated. Thanks!

@nkvoll nkvoll added the backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches label Sep 10, 2025
@ycombinator
Copy link
Copy Markdown
Contributor

nkvoll added the backport-active-9 label

Since this is a bug fix, should we instead add the backport-active-all label?

@nkvoll nkvoll added backport-active-all Automated backport with mergify to all the active branches and removed backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches labels Sep 15, 2025
@elastic-sonarqube
Copy link
Copy Markdown

@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

History

cc @nkvoll

@nkvoll nkvoll merged commit d3b9427 into elastic:main Sep 15, 2025
23 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

@Mergifyio backport 8.18 8.19 9.0 9.1

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Sep 15, 2025

backport 8.18 8.19 9.0 9.1

✅ Backports have been created

Details

mergify bot pushed a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)
@nkvoll nkvoll deleted the liveness-agent-state branch September 15, 2025 12:29
mergify bot pushed a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)
mergify bot pushed a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)
mergify bot pushed a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)
nkvoll added a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)

Co-authored-by: Njal Karevoll <njal@karevoll.no>
nkvoll added a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)

Co-authored-by: Njal Karevoll <njal@karevoll.no>
nkvoll added a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)

Co-authored-by: Njal Karevoll <njal@karevoll.no>
nkvoll added a commit that referenced this pull request Sep 15, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment

(cherry picked from commit d3b9427)

Co-authored-by: Njal Karevoll <njal@karevoll.no>
v1v added a commit that referenced this pull request Sep 16, 2025
* upstream: (26 commits)
  fix: ensure EDOT subprocess shuts down gracefully on agent termination (#9886)
  [main][Automation] Update versions (#9976)
  Add Collector reference docs and automation (#9953)
  [beatreceivers] Integrate beatsauthextension (#9257)
  [main][Automation] Update versions (#9941)
  Update OTel components to v0.132.0/v1.38.0 (#9954)
  Enhancement/5235 wrap errors when marking upgrade (#9366)
  Mount Go build cache into crossbuild container (#9094)
  Liveness agent state (#9673)
  [main][Automation] Bump VM Image version to 1757725254 (#9942)
  Enhancement/5235 correctly wrap errors from copyActionDir and copyRunDirectory (#9349)
  [main][Automation] Update elastic/beats to afc53c0479ac (#9874)
  Add -coverpkg option when running unit test to calculate coverage across packages (#9913)
  Cache binaries downloaded for packaging locally (#9133)
  [main][Automation] Update versions (#9897)
  Disable flaky test TestBeatsReceiverLogs (#9891)
  Allow overriding AGENT_PACKAGE_VERSION and MANIFEST_URL when USE_PACKAGE_VERSION=true (#9864)
  add ingest-docs team as CODEOWNERS for release notes and docset.yml (#9865)
  fix: correct spelling of 'output' in various templates and monitoring code (#9827)
  k8s: Add comment around hostUsers for Universal Profiling deployments (#9847)
  ...
intxgo pushed a commit to intxgo/elastic-agent that referenced this pull request Sep 24, 2025
* fix(tests): update liveness/readiness test cases to assert status code and remove unused vars

* fix: correct order of fields in LivenessFailConfig for degraded state

* fix: remove unnecessary check for coordinator mode in liveness handler (already handled)

* fix: add unhealthy coordinator state handling in liveness handler

* add changelog fragment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Liveness endpoint does not consider overall agent state, only component state

5 participants