Skip to content

Fix liveness reload config handling#4586

Merged
fearful-symmetry merged 11 commits intoelastic:mainfrom
fearful-symmetry:fix-liveness-reload-config-handling
Apr 24, 2024
Merged

Fix liveness reload config handling#4586
fearful-symmetry merged 11 commits intoelastic:mainfrom
fearful-symmetry:fix-liveness-reload-config-handling

Conversation

@fearful-symmetry
Copy link
Copy Markdown
Contributor

@fearful-symmetry fearful-symmetry commented Apr 16, 2024

What does this PR do?

Closes #4582

This reverts the revert in #4583 and adds some extra logic so that if the HTTP monitoring server is enabled, and we get a config reload that does not explicitly set it to disabled we keep it enabled.

This also adds a ton of tests, because now I'm paranoid.

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

@fearful-symmetry fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Apr 16, 2024
@fearful-symmetry fearful-symmetry self-assigned this Apr 16, 2024
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner April 16, 2024 19:52
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@fearful-symmetry fearful-symmetry requested a review from pchila April 16, 2024 19:52
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 16, 2024

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 16, 2024

The instructions in https://github.com/elastic/elastic-agent?tab=readme-ov-file#testing-on-elastic-cloud should let you create a cloud deployment that uses this version of the agent.

You can use that to confirm integrations server comes up as healthy.

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 16, 2024

I tested this locally and it doesn't seem to work. I installed a standalone agent with the HTTP server enabled

In elastic-agent.yml:

agent.monitoring:
  enabled: true
  http:
      enabled: true
      port: 6791
❯ sudo ./elastic-agent install -f
❯ curl localhost:6791/processes
{"processes":[{"id":"system/metrics-default","binary":"metricbeat","source":{"kind":"configured","outputs":["default"]}},{"id":"filestream-monitoring","binary":"filebeat","source":{"kind":"internal","outputs":["monitoring"]}},{"id":"beat/metrics-monitoring","binary":"metricbeat","source":{"kind":"internal","outputs":["monitoring"]}},{"id":"http/metrics-monitoring","binary":"metricbeat","source":{"kind":"internal","outputs":["monitoring"]}}]}%

Then I enrolled it in fleet with the default policy:

sudo elastic-agent enroll --url=https://1cdab201df9f4dbb936887167f9d4aa2.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXX

And I can't hit the processes endpoint anymore:

❯ curl localhost:6791/liveness
404 page not found

The pre-config.yaml has:

agent:
    download:
        sourceURI: https://artifacts.elastic.co/downloads/
    features: null
    monitoring:
        enabled: true
        logs: true
        metrics: true
        namespace: default
        use_output: default

The local-config.yaml has:

    monitoring:
        diagnostics:
            limit:
                burst: 1
                interval: 1m0s
            uploader:
                initdur: 1s
                maxdur: 10m0s
                maxretries: 10
        enabled: true
        http:
            buffer: null
            enabled: true
            host: localhost
            port: 6791

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

I'm still trying to figure out what's going on here. @cmacknz it looks like you're using staging.found.no for the fleet server, but the 8.14 snapshot build seems broken, probably as fallout from this PR. Did you just get lucky with the cloud or am I being dumb?

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

fearful-symmetry commented Apr 17, 2024

Interestingly, I can't reproduce this with the local file config, despite the fact that it should be the same reloading mechanism passing the same config. I wonder if there's something weird going on with the config merging.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

Alright, it's getting late. Hopefully tomorrow the snapshots will be updated and I'll have a few more options for reproducing this.

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 17, 2024

I'm still trying to figure out what's going on here. @cmacknz it looks like you're using staging.found.no for the fleet server, but the 8.14 snapshot build seems broken, probably as fallout from this PR. Did you just get lucky with the cloud or am I being dumb?

I created my cloud deployment before the problem was introduced.

Using an 8.13.3-SNAPSHOT deployment or a stack from elastic-package are ways around this if the 8.14.0-SNAPSHOT isn't fixed tomorrow.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

Alright, can FINALLY reproduce this.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

fearful-symmetry commented Apr 17, 2024

Alright, the behavior of agent with @cmacknz 's example (elastic-agent install -f & elastic-agent enroll ...) is different from what I expected, and different from how it behaves when you reload agent with something like a static config file change:

  1. You run install -f, agent starts with the config file config
  2. We call the HTTP monitor Reload, the monitor sees that the http endpoint is enabled, we start it
  3. You run enroll
  4. Agent restarts
  5. We call the HTTP monitor Reload, the config is set to disabled, and since agent has just started, we don't know that we were previously enabled
  6. the HTTP endpoint remains disabled, no further Reload events happen.

That's....a slightly more complex problem. It's not "we need to care about state between reloads" it's "the agent must know what happened before it started"

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 17, 2024

Now that I remember more on enroll we actually replace the current agent configuration with a blank one before getting it from Fleet.

This happens in a non-obvious place here:

store := storage.NewReplaceOnSuccessStore(
pathConfigFile,
application.DefaultAgentFleetConfig,
encStore,
storeOpts...,
)

The starting configuration is here:

# ================================ General =====================================
# Beats is configured under Fleet, you can define most settings
# from the Kibana UI. You can update this file to configure the settings that
# are not supported by Fleet.
fleet:
enabled: true

The agent is restarted when it is enrolled as well.

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 17, 2024

The thing is this reset of the configuration happened even without your reload changes so I'm now confused how monitoring remained enabled in cloud to begin with.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

fearful-symmetry commented Apr 17, 2024

So, in the code currently in main, the HTTP config gets set remarkably early, in loadConfig():

func loadConfig(ctx context.Context, override cfgOverrider, runAsOtel bool) (*configuration.Configuration, error) {

This eventually loads a bunch of the state from the encrypted state store. This all happens before fleet init, I think. I assume what's happening is that that when we install, the config in the .enc files still have our old hard-coded config, and those are acting as overrides that get passed to the HTTP monitoring config. In this PR, we don't care about the state store, we just get our config from the reloader, which gets it from fleet.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

I'm gonna see If I can get the HTTP handler to consider the "early" overrides config.

@pchila
Copy link
Copy Markdown
Member

pchila commented Apr 17, 2024

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

fearful-symmetry commented Apr 17, 2024

Alright @cmacknz this should work.

However, this raises a few more implementation questions, since it means we need to actually care about not just the fleet config, but also the config at init time. Do we try to merge them somehow? Just care about the enabled value?

Going to keep poking at tests.

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 17, 2024

However, this raises a few more implementation questions, since it means we need to actually care about not just the fleet config, but also the config at init time. Do we try to merge them somehow? Just care about the enabled value?

The version of this change that is not breaking for anybody is to work exactly the way it did before, except now it respects changed configurations from fleet and standalone configuration reloads.

That is we want the init time behavior to be unchanged, only what happens after that needs to change.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

Alright, so, this should be a relatively reasonable default. If there's no fleet config set, just revert to the original overrides settings.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

Alright, well, I've tested this, but I'm having some permissions issues with the cloud testing component, the cloud:push command just results in denied: requested access to the resource is denied, despite setting the API key. Either I'm doing something dumb or there's an extra layer of permissions needed somewhere in my cloud account.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

Alright, I added a second integration test that checks for the condition that gave us so much trouble. The test sets a text config, installs agent, makes sure monitoring is still enabled, then uses the overrrides to change the port.

@pierrehilbert
Copy link
Copy Markdown
Contributor

@fearful-symmetry is it ready for review then?

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

@pierrehilbert yup.

Copy link
Copy Markdown
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a comment on the name of a function.

Overall this is easy to follow and very well tested.

// from the coordinator loop. This can be used to as a basic health check,
// as we'll timeout and return false if the coordinator run loop doesn't
// respond to our channel.
func (c *Coordinator) CoordinatorActive(timeout time.Duration) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this prefixed with Coordinator? It is the coordinator, why not just Active()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good point

@elastic-sonarqube
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Team:Elastic-Agent Label for the Agent team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disable automatic reload of the agent monitoring HTTP server until handling of default values is figured out

6 participants