Fix liveness reload config handling#4586
Conversation
|
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
|
This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
NOTE: |
|
The instructions in https://github.com/elastic/elastic-agent?tab=readme-ov-file#testing-on-elastic-cloud should let you create a cloud deployment that uses this version of the agent. You can use that to confirm integrations server comes up as healthy. |
|
I tested this locally and it doesn't seem to work. I installed a standalone agent with the HTTP server enabled In elastic-agent.yml: agent.monitoring:
enabled: true
http:
enabled: true
port: 6791❯ sudo ./elastic-agent install -f
❯ curl localhost:6791/processes
{"processes":[{"id":"system/metrics-default","binary":"metricbeat","source":{"kind":"configured","outputs":["default"]}},{"id":"filestream-monitoring","binary":"filebeat","source":{"kind":"internal","outputs":["monitoring"]}},{"id":"beat/metrics-monitoring","binary":"metricbeat","source":{"kind":"internal","outputs":["monitoring"]}},{"id":"http/metrics-monitoring","binary":"metricbeat","source":{"kind":"internal","outputs":["monitoring"]}}]}%Then I enrolled it in fleet with the default policy: sudo elastic-agent enroll --url=https://1cdab201df9f4dbb936887167f9d4aa2.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXAnd I can't hit the processes endpoint anymore: ❯ curl localhost:6791/liveness
404 page not foundThe pre-config.yaml has: agent:
download:
sourceURI: https://artifacts.elastic.co/downloads/
features: null
monitoring:
enabled: true
logs: true
metrics: true
namespace: default
use_output: defaultThe local-config.yaml has: monitoring:
diagnostics:
limit:
burst: 1
interval: 1m0s
uploader:
initdur: 1s
maxdur: 10m0s
maxretries: 10
enabled: true
http:
buffer: null
enabled: true
host: localhost
port: 6791 |
|
I'm still trying to figure out what's going on here. @cmacknz it looks like you're using staging.found.no for the fleet server, but the 8.14 snapshot build seems broken, probably as fallout from this PR. Did you just get lucky with the cloud or am I being dumb? |
|
Interestingly, I can't reproduce this with the local file config, despite the fact that it should be the same reloading mechanism passing the same config. I wonder if there's something weird going on with the config merging. |
|
Alright, it's getting late. Hopefully tomorrow the snapshots will be updated and I'll have a few more options for reproducing this. |
I created my cloud deployment before the problem was introduced. Using an 8.13.3-SNAPSHOT deployment or a stack from elastic-package are ways around this if the 8.14.0-SNAPSHOT isn't fixed tomorrow. |
|
Alright, can FINALLY reproduce this. |
|
Alright, the behavior of agent with @cmacknz 's example (
That's....a slightly more complex problem. It's not "we need to care about state between reloads" it's "the agent must know what happened before it started" |
|
Now that I remember more on enroll we actually replace the current agent configuration with a blank one before getting it from Fleet. This happens in a non-obvious place here: elastic-agent/internal/pkg/agent/cmd/enroll.go Lines 497 to 502 in eca5bc7 The starting configuration is here: elastic-agent/_meta/elastic-agent.fleet.yml Lines 1 to 7 in eca5bc7 The agent is restarted when it is enrolled as well. |
|
The thing is this reset of the configuration happened even without your reload changes so I'm now confused how monitoring remained enabled in cloud to begin with. |
|
So, in the code currently in main, the HTTP config gets set remarkably early, in elastic-agent/internal/pkg/agent/cmd/run.go Line 387 in eca5bc7 This eventually loads a bunch of the state from the encrypted state store. This all happens before fleet init, I think. I assume what's happening is that that when we install, the config in the |
|
I'm gonna see If I can get the HTTP handler to consider the "early" overrides config. |
|
@fearful-symmetry Maybe you can test this PR with https://github.com/elastic/elastic-agent/blob/main/testing/environments/cloud/README.md ? |
|
Alright @cmacknz this should work. However, this raises a few more implementation questions, since it means we need to actually care about not just the fleet config, but also the config at init time. Do we try to merge them somehow? Just care about the Going to keep poking at tests. |
The version of this change that is not breaking for anybody is to work exactly the way it did before, except now it respects changed configurations from fleet and standalone configuration reloads. That is we want the init time behavior to be unchanged, only what happens after that needs to change. |
|
Alright, so, this should be a relatively reasonable default. If there's no fleet config set, just revert to the original overrides settings. |
|
Alright, well, I've tested this, but I'm having some permissions issues with the cloud testing component, the |
…config still works
|
Alright, I added a second integration test that checks for the condition that gave us so much trouble. The test sets a text config, installs agent, makes sure monitoring is still enabled, then uses the overrrides to change the port. |
|
@fearful-symmetry is it ready for review then? |
|
@pierrehilbert yup. |
blakerouse
left a comment
There was a problem hiding this comment.
Looks good. Just a comment on the name of a function.
Overall this is easy to follow and very well tested.
| // from the coordinator loop. This can be used to as a basic health check, | ||
| // as we'll timeout and return false if the coordinator run loop doesn't | ||
| // respond to our channel. | ||
| func (c *Coordinator) CoordinatorActive(timeout time.Duration) bool { |
There was a problem hiding this comment.
Why is this prefixed with Coordinator? It is the coordinator, why not just Active()?
There was a problem hiding this comment.
yeah, good point
|




What does this PR do?
Closes #4582
This reverts the revert in #4583 and adds some extra logic so that if the HTTP monitoring server is enabled, and we get a config reload that does not explicitly set it to disabled we keep it enabled.
This also adds a ton of tests, because now I'm paranoid.
Why is it important?
Checklist
./changelog/fragmentsusing the changelog tool