Fix liveness reload config handling by fearful-symmetry · Pull Request #4586 · elastic/elastic-agent

fearful-symmetry · 2024-04-16T19:52:35Z

What does this PR do?

This reverts the revert in #4583 and adds some extra logic so that if the HTTP monitoring server is enabled, and we get a config reload that does not explicitly set it to disabled we keep it enabled.

This also adds a ton of tests, because now I'm paranoid.

Why is it important?

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

…lastic#4583) This reverts commit eca5bc7.

…dd tests

elasticmachine · 2024-04-16T19:52:37Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2024-04-16T19:53:36Z

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

cmacknz · 2024-04-16T20:03:06Z

The instructions in https://github.com/elastic/elastic-agent?tab=readme-ov-file#testing-on-elastic-cloud should let you create a cloud deployment that uses this version of the agent.

You can use that to confirm integrations server comes up as healthy.

cmacknz · 2024-04-16T20:42:31Z

I tested this locally and it doesn't seem to work. I installed a standalone agent with the HTTP server enabled

In elastic-agent.yml:

agent.monitoring:
  enabled: true
  http:
      enabled: true
      port: 6791

❯ sudo ./elastic-agent install -f
❯ curl localhost:6791/processes
{"processes":[{"id":"system/metrics-default","binary":"metricbeat","source":{"kind":"configured","outputs":["default"]}},{"id":"filestream-monitoring","binary":"filebeat","source":{"kind":"internal","outputs":["monitoring"]}},{"id":"beat/metrics-monitoring","binary":"metricbeat","source":{"kind":"internal","outputs":["monitoring"]}},{"id":"http/metrics-monitoring","binary":"metricbeat","source":{"kind":"internal","outputs":["monitoring"]}}]}%

Then I enrolled it in fleet with the default policy:

sudo elastic-agent enroll --url=https://1cdab201df9f4dbb936887167f9d4aa2.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXX

And I can't hit the processes endpoint anymore:

❯ curl localhost:6791/liveness
404 page not found

The pre-config.yaml has:

agent:
    download:
        sourceURI: https://artifacts.elastic.co/downloads/
    features: null
    monitoring:
        enabled: true
        logs: true
        metrics: true
        namespace: default
        use_output: default

The local-config.yaml has:

    monitoring:
        diagnostics:
            limit:
                burst: 1
                interval: 1m0s
            uploader:
                initdur: 1s
                maxdur: 10m0s
                maxretries: 10
        enabled: true
        http:
            buffer: null
            enabled: true
            host: localhost
            port: 6791

fearful-symmetry · 2024-04-16T23:56:56Z

I'm still trying to figure out what's going on here. @cmacknz it looks like you're using staging.found.no for the fleet server, but the 8.14 snapshot build seems broken, probably as fallout from this PR. Did you just get lucky with the cloud or am I being dumb?

fearful-symmetry · 2024-04-17T01:17:55Z

Interestingly, I can't reproduce this with the local file config, despite the fact that it should be the same reloading mechanism passing the same config. I wonder if there's something weird going on with the config merging.

fearful-symmetry · 2024-04-17T01:45:36Z

Alright, it's getting late. Hopefully tomorrow the snapshots will be updated and I'll have a few more options for reproducing this.

cmacknz · 2024-04-17T02:05:37Z

I'm still trying to figure out what's going on here. @cmacknz it looks like you're using staging.found.no for the fleet server, but the 8.14 snapshot build seems broken, probably as fallout from this PR. Did you just get lucky with the cloud or am I being dumb?

I created my cloud deployment before the problem was introduced.

Using an 8.13.3-SNAPSHOT deployment or a stack from elastic-package are ways around this if the 8.14.0-SNAPSHOT isn't fixed tomorrow.

fearful-symmetry · 2024-04-17T15:13:20Z

Alright, can FINALLY reproduce this.

fearful-symmetry · 2024-04-17T15:55:17Z

Alright, the behavior of agent with @cmacknz 's example (elastic-agent install -f & elastic-agent enroll ...) is different from what I expected, and different from how it behaves when you reload agent with something like a static config file change:

You run install -f, agent starts with the config file config
We call the HTTP monitor Reload, the monitor sees that the http endpoint is enabled, we start it
You run enroll
Agent restarts
We call the HTTP monitor Reload, the config is set to disabled, and since agent has just started, we don't know that we were previously enabled
the HTTP endpoint remains disabled, no further Reload events happen.

That's....a slightly more complex problem. It's not "we need to care about state between reloads" it's "the agent must know what happened before it started"

cmacknz · 2024-04-17T16:01:58Z

Now that I remember more on enroll we actually replace the current agent configuration with a blank one before getting it from Fleet.

This happens in a non-obvious place here:

elastic-agent/internal/pkg/agent/cmd/enroll.go

Lines 497 to 502 in eca5bc7

    
           store := storage.NewReplaceOnSuccessStore( 
        
           	pathConfigFile, 
        
           	application.DefaultAgentFleetConfig, 
        
           	encStore, 
        
           	storeOpts..., 
        
           )

The starting configuration is here:

elastic-agent/_meta/elastic-agent.fleet.yml

Lines 1 to 7 in eca5bc7

    
           # ================================ General ===================================== 
        
           # Beats is configured under Fleet, you can define most settings 
        
           # from the Kibana UI. You can update this file to configure the settings that 
        
           # are not supported by Fleet. 
        
           fleet: 
        
             enabled: true

The agent is restarted when it is enrolled as well.

cmacknz · 2024-04-17T16:05:16Z

The thing is this reset of the configuration happened even without your reload changes so I'm now confused how monitoring remained enabled in cloud to begin with.

fearful-symmetry · 2024-04-17T16:28:55Z

So, in the code currently in main, the HTTP config gets set remarkably early, in loadConfig():

elastic-agent/internal/pkg/agent/cmd/run.go

Line 387 in eca5bc7

    
           func loadConfig(ctx context.Context, override cfgOverrider, runAsOtel bool) (*configuration.Configuration, error) {

This eventually loads a bunch of the state from the encrypted state store. This all happens before fleet init, I think. I assume what's happening is that that when we install, the config in the .enc files still have our old hard-coded config, and those are acting as overrides that get passed to the HTTP monitoring config. In this PR, we don't care about the state store, we just get our config from the reloader, which gets it from fleet.

fearful-symmetry · 2024-04-17T16:34:05Z

I'm gonna see If I can get the HTTP handler to consider the "early" overrides config.

pchila · 2024-04-17T16:42:36Z

@fearful-symmetry Maybe you can test this PR with https://github.com/elastic/elastic-agent/blob/main/testing/environments/cloud/README.md ?

fearful-symmetry · 2024-04-17T17:06:05Z

Alright @cmacknz this should work.

However, this raises a few more implementation questions, since it means we need to actually care about not just the fleet config, but also the config at init time. Do we try to merge them somehow? Just care about the enabled value?

Going to keep poking at tests.

cmacknz · 2024-04-17T17:43:25Z

However, this raises a few more implementation questions, since it means we need to actually care about not just the fleet config, but also the config at init time. Do we try to merge them somehow? Just care about the enabled value?

The version of this change that is not breaking for anybody is to work exactly the way it did before, except now it respects changed configurations from fleet and standalone configuration reloads.

That is we want the init time behavior to be unchanged, only what happens after that needs to change.

fearful-symmetry · 2024-04-17T18:39:00Z

Alright, so, this should be a relatively reasonable default. If there's no fleet config set, just revert to the original overrides settings.

fearful-symmetry · 2024-04-17T21:57:34Z

Alright, well, I've tested this, but I'm having some permissions issues with the cloud testing component, the cloud:push command just results in denied: requested access to the resource is denied, despite setting the API key. Either I'm doing something dumb or there's an extra layer of permissions needed somewhere in my cloud account.

…config still works

fearful-symmetry · 2024-04-23T17:48:46Z

Alright, I added a second integration test that checks for the condition that gave us so much trouble. The test sets a text config, installs agent, makes sure monitoring is still enabled, then uses the overrrides to change the port.

pierrehilbert · 2024-04-24T06:24:55Z

@fearful-symmetry is it ready for review then?

fearful-symmetry · 2024-04-24T14:46:12Z

@pierrehilbert yup.

blakerouse

Looks good. Just a comment on the name of a function.

Overall this is easy to follow and very well tested.

blakerouse · 2024-04-24T15:21:19Z

internal/pkg/agent/application/coordinator/coordinator.go

+// from the coordinator loop. This can be used to as a basic health check,
+// as we'll timeout and return false if the coordinator run loop doesn't
+// respond to our channel.
+func (c *Coordinator) CoordinatorActive(timeout time.Duration) bool {


Why is this prefixed with Coordinator? It is the coordinator, why not just Active()?

yeah, good point

elastic-sonarqube · 2024-04-24T16:52:12Z

Quality Gate passed

Issues
2 New issues
3 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
76.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

fearful-symmetry added 3 commits April 16, 2024 08:04

Reapply "Add /liveness endpoint to elastic-agent (elastic#4499)" (e…

521c55d

…lastic#4583) This reverts commit eca5bc7.

add behavior to not disable http monitor on reload with nil config, a…

71516b9

…dd tests

improve comments

3079183

fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Apr 16, 2024

fearful-symmetry self-assigned this Apr 16, 2024

fearful-symmetry requested a review from a team as a code owner April 16, 2024 19:52

fearful-symmetry requested a review from michalpristas April 16, 2024 19:52

fearful-symmetry requested a review from pchila April 16, 2024 19:52

mergify bot added the backport-skip label Apr 16, 2024

linter

408ef3d

fearful-symmetry added 2 commits April 16, 2024 13:03

more linter...

9608d0d

fix spelling

7c847b9

check original config state when reloading config

33c4383

change behavior of config set from overrides

be6ad3f

fix tests

6ecdd74

add second test to make sure old behavior with hard-coded monitoring …

10b3cc3

…config still works

pierrehilbert requested a review from blakerouse April 24, 2024 06:24

blakerouse approved these changes Apr 24, 2024

View reviewed changes

rename method

08f8c40

fearful-symmetry merged commit 6cdb8af into elastic:main Apr 24, 2024

juliaElastic mentioned this pull request May 10, 2024

[Fleet] Enable agent.monitoring.http settings on agent policy UI elastic/kibana#180922

Merged

cmacknz mentioned this pull request May 10, 2024

WIP: Add liveness/readiness agent probe. elastic/cloud-on-k8s#7791

Closed

cmacknz mentioned this pull request May 27, 2024

[Elastic Agent] Enable HTTP endpoint configuration for Elastic Agent #107

Closed

pierrehilbert mentioned this pull request Jun 26, 2024

[8.14](backport #4961) Introduce agent.monitoring.metrics_period #5003

Merged

5 tasks

Conversation

fearful-symmetry commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Uh oh!

elasticmachine commented Apr 16, 2024

Uh oh!

mergify bot commented Apr 16, 2024

Uh oh!

cmacknz commented Apr 16, 2024

Uh oh!

cmacknz commented Apr 16, 2024

Uh oh!

fearful-symmetry commented Apr 16, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fearful-symmetry commented Apr 17, 2024

Uh oh!

cmacknz commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacknz commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacknz commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fearful-symmetry commented Apr 17, 2024

Uh oh!

pchila commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacknz commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 17, 2024

Uh oh!

fearful-symmetry commented Apr 23, 2024

Uh oh!

pierrehilbert commented Apr 24, 2024

Uh oh!

fearful-symmetry commented Apr 24, 2024

Uh oh!

blakerouse left a comment

Choose a reason for hiding this comment

Uh oh!

blakerouse Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

fearful-symmetry Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Apr 24, 2024

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fearful-symmetry commented Apr 16, 2024 •

edited

Loading

fearful-symmetry commented Apr 17, 2024 •

edited

Loading

fearful-symmetry commented Apr 17, 2024 •

edited

Loading

cmacknz commented Apr 17, 2024 •

edited

Loading

fearful-symmetry commented Apr 17, 2024 •

edited

Loading

fearful-symmetry commented Apr 17, 2024 •

edited

Loading