Rename to management.Manager, add UpdateStatus to Manager interface.#19114
Rename to management.Manager, add UpdateStatus to Manager interface.#19114blakerouse merged 4 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/ingest-management (Team:Ingest Management) |
|
Pinging @elastic/integrations-services (Team:Services) |
💚 Build SucceededExpand to view the summary
Build stats
Test stats 🧪
Steps errorsExpand to view the steps failures
|
| func (*nilManager) Stop() {} | ||
| func (*nilManager) CheckRawConfig(cfg *common.Config) error { return nil } | ||
| func (n *nilManager) UpdateStatus(status Status, msg string) { | ||
| n.lock.Lock() |
There was a problem hiding this comment.
do we need this here if this is nil manager - noop manager?
There was a problem hiding this comment.
I do it only for logging the status changes. I think it is good to show status updates in the logger as well, so it's clear what is happening in the beat, even if its not controlled by Elastic Agent.
|
|
||
| func (cm *Manager) OnConfig(s string) { | ||
| cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration") | ||
| cm.UpdateStatus(management.Configuring, "Updating configuration") |
There was a problem hiding this comment.
should we switch back after config is applied or do we let beat to do that
There was a problem hiding this comment.
It does switch back to Running at the end of OnConfig.
There was a problem hiding this comment.
you;re right it was hidden so i overlooked
There was a problem hiding this comment.
If I read this right OnConfig sets the status to 'Healthy' at the end. Are we sure this is the correct state to report? Do we 'forget' the internal state here?
How about removing the 'Configuring' status? The less states (and protocol behavior) we have, the less we can get it wrong.
There was a problem hiding this comment.
I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.
At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).
I do not think we should removing Configuring I think it is a key state for watching the observability of the Beat and propogating that information to Fleet.
It is good to know that going Healthy -> Configuring -> Failed is because everything was good, then updating the configuration caused it to fail.
There was a problem hiding this comment.
I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.
At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).
+1
Is this required change/cleanup documented in an issue?
| return res, nil | ||
| } | ||
|
|
||
| func statusToProtoStatus(status management.Status) proto.StateObserved_Status { |
There was a problem hiding this comment.
can we add unit test to test switching statuses back and forth with some noop client?
There was a problem hiding this comment.
At the moment I don't believe client from elastic-agent-client is exposed as an interface. To allow me to completely replace it with a noop client.
| if cm.status != status || cm.msg != msg { | ||
| cm.status = status | ||
| cm.msg = msg | ||
| cm.lock.Unlock() |
There was a problem hiding this comment.
we should test this with concurrent request, what i fear may occur is that wit many concurrent changes of status
we will end up with different internal state vs reported state
do you see some pros/cons of doing reporting using a channel or having full flow lock?
There was a problem hiding this comment.
We can also try to 'forget' outdated status updates if we get new ones while we are waiting. This could be implemented using sync.Cond.
There was a problem hiding this comment.
My only reason to track the state at this level is so it can be logged when it is changed. We could remove the internal tracking and locking in the Fleet Manager if we just log every time UpdateStatus is called. Then the locking inside of client will handle updating the correct state to the manager.
Or we could place the call to client inside of the lock to ensure that its always the same between the 2.
There was a problem hiding this comment.
i think client options is also ok. i think logging inconsistent states will cause us more trouble then help us during some sdh triaging
There was a problem hiding this comment.
I went with moving the logging and client into the lock.
libbeat/management/management.go
Outdated
| CheckRawConfig(cfg *common.Config) error | ||
|
|
||
| // UpdateStatus called when the status of the beat has changed. | ||
| UpdateStatus(status Status, msg string) |
There was a problem hiding this comment.
How about moving the UpdateStatus method into it's own interface? I'd like to push only 'status' reporting to the inputs/outputs and provide a set of helper functions that wrap a 'StatusReporter', merging status changes from multiple subsystems. But in the end we can also introduce the minimal interface in another package when we need it.
There was a problem hiding this comment.
I can move it to its own interface, would it also still be in the management.Manager interface?
type Manager interface {
StatusReporter
...
}
| cm.logger.Infof("Status change to %s: %s", status, msg) | ||
| return | ||
| } | ||
| cm.lock.Unlock() |
There was a problem hiding this comment.
Why unlock the call to client.Status? If we have a race between unlock and the Status being send by two go-routines, how can you tell which one is the newer status update?
There was a problem hiding this comment.
Yeah I was commenting about on that as well, I can move the client.Status call inside of the lock. I think that would be the best.
…or statusToProtoStatus.
|
@michalpristas @urso This is ready for another review. |
|
|
||
| func (cm *Manager) OnConfig(s string) { | ||
| cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration") | ||
| cm.UpdateStatus(management.Configuring, "Updating configuration") |
There was a problem hiding this comment.
you;re right it was hidden so i overlooked
| cm.status = status | ||
| cm.msg = msg | ||
| cm.lock.Unlock() | ||
| cm.client.Status(statusToProtoStatus(status), msg) |
There was a problem hiding this comment.
here we update status on the client which will result in some action either during checkin or in watcher after some timeout.
if somebody reports failure i think (correct me if my memory failed) we talked about re-starting application immediately, can you make sure this flow is valid?
There was a problem hiding this comment.
This flow is valid from the side of the Elastic Agent, as reporting Failed will cause the application to be killed and restarted. This is fully tested in covered in the Elastic Agent unit tests, both in the case that the application returns Failed over the protocol and if the application just crashes.
| err = errors.Wrap(err, "could not apply the configuration") | ||
| cm.logger.Error(err) | ||
| cm.client.Status(proto.StateObserved_FAILED, err.Error()) | ||
| cm.UpdateStatus(management.Failed, err.Error()) |
There was a problem hiding this comment.
Given all these errors I wonder what 'Failed' means. Is the beat really 'failed' or are we (the overall application with agent) running in a degraded mode because not all configurations could have been applied? Even if we fail here the Beat continues to publish events for existing inputs, right?
Are we mixing the status of the 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?
There was a problem hiding this comment.
As stated in the list of statuses there is Failed and Degraded.
// Degraded is status describing application is degraded.
Degraded
// Failed is status describing application is failed. This status should
// only be used in the case the beat should stop running as the failure
// cannot be recovered.
Failed
In Degarded state beat can keep sending events and agent will not stop the process. Failed is failed, I have completely failed, restart me.
There was a problem hiding this comment.
Yeah, I've seen these status types. Is this method run on startup only, or also if the beat is already active? In the later case the Beat will continue with the old state if there was an error decoding the configuration. Do we want to enforce a 'restart' if the config update operation failed? Will Failed trigger a restart by the Agent?
Are we mixing the status of the update 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?
I'm more or less wondering about the overal semantics of OnConfig and the single errors we can get.
There was a problem hiding this comment.
It will run both on startup and while running.
The beat doesn't need to worry about handling the restart on Failed status, the Agent will do that work for it, restart it and bring it back up.
michalpristas
left a comment
There was a problem hiding this comment.
I think it's good for first iteration and exact form will crystallize once we have at least some healthcheck implemented
…lastic#19114) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus. (cherry picked from commit 05c9065)
…19114) (#19172) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus. (cherry picked from commit 05c9065)
…ngs-archive * upstream/master: Fix minor spelling error in Jenkinsfile (elastic#19153) [CI] fail if not possible to install python3 (elastic#19164) [Elastic Agent] Improved mage demo experience (elastic#18445) [yum] Clear cached data and add retry loop (elastic#19182) fix lint job by updating NOTICE (elastic#19161) Fix tags for coredns/envoyproxy (elastic#19134) Disable host.* fields by default for CrowdStrike module (elastic#19132) Allow host.* fields to be disabled in Zeek module (elastic#19113) Rename to management.Manager, add UpdateStatus to Manager interface. (elastic#19114) Edit Elastic Agent docs (elastic#19146) [JJBB] create job definition for the golang-crossbuild project (elastic#19162) Fix incorrect usage of hints builder when exposed port is a substring of the hint (elastic#19052) Add basic cloudfoundry integration tests (elastic#19018)
…lastic#19114) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus.
What does this PR do?
Renames
management.ConfigManagertomanagement.Managerbeing that it now does more than just manage configurations. AddsUpdateStatustomanagement.Managerto allow a beat to update the status.Using
FleetManagerthe status will be propagated over theelastic-agent-clientto the controlling agent.Why is it important?
So status can be reported both to the controlling Agent and up to Fleet for observability of the beats that Agent is running.
Reporting a
Failedstatus to Agent will result in the Agent stopping and restarting the beat.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Author's Checklist
How to test this PR locally
Related issues
HealthCheckfunction in every beats. #17737Use cases
Screenshots
Logs