Rename to management.Manager, add UpdateStatus to Manager interface. by blakerouse · Pull Request #19114 · elastic/beats

blakerouse · 2020-06-10T15:52:24Z

What does this PR do?

Renames management.ConfigManager to management.Manager being that it now does more than just manage configurations. Adds UpdateStatus to management.Manager to allow a beat to update the status.

Using FleetManager the status will be propagated over the elastic-agent-client to the controlling agent.

Why is it important?

So status can be reported both to the controlling Agent and up to Fleet for observability of the beats that Agent is running.

Reporting a Failed status to Agent will result in the Agent stopping and restarting the beat.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

Related issues

Closes [Agent] Implement an HealthCheck function in every beats. #17737

Use cases

Screenshots

Logs

…us to Manager interface.

elasticmachine · 2020-06-10T15:52:53Z

Pinging @elastic/ingest-management (Team:Ingest Management)

elasticmachine · 2020-06-10T15:52:55Z

Pinging @elastic/integrations-services (Team:Services)

elasticmachine · 2020-06-10T15:56:35Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request #19114 updated]
Start Time: 2020-06-11T12:41:04.433+0000
Duration: 81 min 18 sec

Test stats 🧪

Test	Results
Failed	0
Passed	9437
Skipped	1574
Total	11011

Steps errors

Expand to view the steps failures

Name: Report to Codecov
- Description: curl -sSLo codecov https://codecov.io/bash for i in auditbeat filebeat heartbeat libbeat metricbeat packetbeat winlogbeat journalbeat do FILE="${i}/build/coverage/full.cov" if [ -f "${FILE}" ]; then bash codecov -f "${FILE}" fi done
- Duration: 2 min 22 sec
- Start Time: 2020-06-11T13:11:54.685+0000
- log

michalpristas · 2020-06-11T06:38:02Z

libbeat/management/management.go

+func (*nilManager) Stop()                                   {}
+func (*nilManager) CheckRawConfig(cfg *common.Config) error { return nil }
+func (n *nilManager) UpdateStatus(status Status, msg string) {
+	n.lock.Lock()


do we need this here if this is nil manager - noop manager?

I do it only for logging the status changes. I think it is good to show status updates in the logger as well, so it's clear what is happening in the beat, even if its not controlled by Elastic Agent.

michalpristas · 2020-06-11T06:40:15Z

x-pack/libbeat/management/fleet/manager.go

+
 func (cm *Manager) OnConfig(s string) {
-	cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration")
+	cm.UpdateStatus(management.Configuring, "Updating configuration")


should we switch back after config is applied or do we let beat to do that

It does switch back to Running at the end of OnConfig.

you;re right it was hidden so i overlooked

If I read this right OnConfig sets the status to 'Healthy' at the end. Are we sure this is the correct state to report? Do we 'forget' the internal state here?

How about removing the 'Configuring' status? The less states (and protocol behavior) we have, the less we can get it wrong.

I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.

At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).

I do not think we should removing Configuring I think it is a key state for watching the observability of the Beat and propogating that information to Fleet.

It is good to know that going Healthy -> Configuring -> Failed is because everything was good, then updating the configuration caused it to fail.

I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.

At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).

+1
Is this required change/cleanup documented in an issue?

michalpristas · 2020-06-11T06:41:06Z

x-pack/libbeat/management/fleet/manager.go

 	return res, nil
 }
+
+func statusToProtoStatus(status management.Status) proto.StateObserved_Status {


can we add unit test to test switching statuses back and forth with some noop client?

At the moment I don't believe client from elastic-agent-client is exposed as an interface. To allow me to completely replace it with a noop client.

michalpristas · 2020-06-11T07:12:11Z

x-pack/libbeat/management/fleet/manager.go

+	if cm.status != status || cm.msg != msg {
+		cm.status = status
+		cm.msg = msg
+		cm.lock.Unlock()


we should test this with concurrent request, what i fear may occur is that wit many concurrent changes of status
we will end up with different internal state vs reported state
do you see some pros/cons of doing reporting using a channel or having full flow lock?

We can also try to 'forget' outdated status updates if we get new ones while we are waiting. This could be implemented using sync.Cond.

My only reason to track the state at this level is so it can be logged when it is changed. We could remove the internal tracking and locking in the Fleet Manager if we just log every time UpdateStatus is called. Then the locking inside of client will handle updating the correct state to the manager.

Or we could place the call to client inside of the lock to ensure that its always the same between the 2.

i think client options is also ok. i think logging inconsistent states will cause us more trouble then help us during some sdh triaging

I went with moving the logging and client into the lock.

urso · 2020-06-11T11:13:34Z

libbeat/management/management.go

 	CheckRawConfig(cfg *common.Config) error
+
+	// UpdateStatus called when the status of the beat has changed.
+	UpdateStatus(status Status, msg string)


How about moving the UpdateStatus method into it's own interface? I'd like to push only 'status' reporting to the inputs/outputs and provide a set of helper functions that wrap a 'StatusReporter', merging status changes from multiple subsystems. But in the end we can also introduce the minimal interface in another package when we need it.

I can move it to its own interface, would it also still be in the management.Manager interface?

type Manager interface { StatusReporter ... }

urso · 2020-06-11T11:24:35Z

x-pack/libbeat/management/fleet/manager.go

+		cm.logger.Infof("Status change to %s: %s", status, msg)
+		return
+	}
+	cm.lock.Unlock()


Why unlock the call to client.Status? If we have a race between unlock and the Status being send by two go-routines, how can you tell which one is the newer status update?

Yeah I was commenting about on that as well, I can move the client.Status call inside of the lock. I think that would be the best.

…or statusToProtoStatus.

blakerouse · 2020-06-12T11:44:45Z

@michalpristas @urso This is ready for another review.

michalpristas · 2020-06-12T12:35:21Z

x-pack/libbeat/management/fleet/manager.go

+
 func (cm *Manager) OnConfig(s string) {
-	cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration")
+	cm.UpdateStatus(management.Configuring, "Updating configuration")


you;re right it was hidden so i overlooked

michalpristas · 2020-06-12T12:46:42Z

x-pack/libbeat/management/fleet/manager.go

+		cm.status = status
+		cm.msg = msg
+		cm.lock.Unlock()
+		cm.client.Status(statusToProtoStatus(status), msg)


here we update status on the client which will result in some action either during checkin or in watcher after some timeout.
if somebody reports failure i think (correct me if my memory failed) we talked about re-starting application immediately, can you make sure this flow is valid?

This flow is valid from the side of the Elastic Agent, as reporting Failed will cause the application to be killed and restarted. This is fully tested in covered in the Elastic Agent unit tests, both in the case that the application returns Failed over the protocol and if the application just crashes.

urso · 2020-06-12T13:21:52Z

x-pack/libbeat/management/fleet/manager.go

 		err = errors.Wrap(err, "could not apply the configuration")
 		cm.logger.Error(err)
-		cm.client.Status(proto.StateObserved_FAILED, err.Error())
+		cm.UpdateStatus(management.Failed, err.Error())


Given all these errors I wonder what 'Failed' means. Is the beat really 'failed' or are we (the overall application with agent) running in a degraded mode because not all configurations could have been applied? Even if we fail here the Beat continues to publish events for existing inputs, right?

Are we mixing the status of the 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?

As stated in the list of statuses there is Failed and Degraded.

// Degraded is status describing application is degraded. Degraded // Failed is status describing application is failed. This status should // only be used in the case the beat should stop running as the failure // cannot be recovered. Failed

In Degarded state beat can keep sending events and agent will not stop the process. Failed is failed, I have completely failed, restart me.

Yeah, I've seen these status types. Is this method run on startup only, or also if the beat is already active? In the later case the Beat will continue with the old state if there was an error decoding the configuration. Do we want to enforce a 'restart' if the config update operation failed? Will Failed trigger a restart by the Agent?

Are we mixing the status of the update 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?

I'm more or less wondering about the overal semantics of OnConfig and the single errors we can get.

It will run both on startup and while running.

The beat doesn't need to worry about handling the restart on Failed status, the Agent will do that work for it, restart it and bring it back up.

michalpristas

I think it's good for first iteration and exact form will crystallize once we have at least some healthcheck implemented

…lastic#19114) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus. (cherry picked from commit 05c9065)

…19114) (#19172) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus. (cherry picked from commit 05c9065)

…ngs-archive * upstream/master: Fix minor spelling error in Jenkinsfile (elastic#19153) [CI] fail if not possible to install python3 (elastic#19164) [Elastic Agent] Improved mage demo experience (elastic#18445) [yum] Clear cached data and add retry loop (elastic#19182) fix lint job by updating NOTICE (elastic#19161) Fix tags for coredns/envoyproxy (elastic#19134) Disable host.* fields by default for CrowdStrike module (elastic#19132) Allow host.* fields to be disabled in Zeek module (elastic#19113) Rename to management.Manager, add UpdateStatus to Manager interface. (elastic#19114) Edit Elastic Agent docs (elastic#19146) [JJBB] create job definition for the golang-crossbuild project (elastic#19162) Fix incorrect usage of hints builder when exposed port is a substring of the hint (elastic#19052) Add basic cloudfoundry integration tests (elastic#19018)

…lastic#19114) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus.

blakerouse added 2 commits June 10, 2020 11:41

Rename management.ConfigManager to management.Manager, add UpdateStat…

64ffc12

…us to Manager interface.

Update docstring for Failed status.

725505d

blakerouse requested a review from a team as a code owner June 10, 2020 15:52

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2020

blakerouse added Team:Ingest Management Team:Services (Deprecated) Label for the former Integrations-Services team labels Jun 10, 2020

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2020

blakerouse self-assigned this Jun 10, 2020

Add to developer changelog.

5ab1334

ph requested review from michalpristas and urso June 10, 2020 17:09

michalpristas reviewed Jun 11, 2020

View reviewed changes

urso reviewed Jun 11, 2020

View reviewed changes

Add StatusReporter interface, wrap client.Status in lock. Add tests f…

2e8affe

…or statusToProtoStatus.

ph requested a review from urso June 12, 2020 12:15

michalpristas reviewed Jun 12, 2020

View reviewed changes

urso reviewed Jun 12, 2020

View reviewed changes

michalpristas approved these changes Jun 12, 2020

View reviewed changes

blakerouse merged commit 05c9065 into elastic:master Jun 12, 2020

blakerouse deleted the libbeat-manager-status branch June 12, 2020 17:22

blakerouse mentioned this pull request Jun 12, 2020

Cherry-pick #19114 to 7.x: Rename to management.Manager, add UpdateStatus to Manager interface. #19172

Merged

4 tasks

blakerouse added the v7.9.0 label Jun 12, 2020

Conversation

blakerouse commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Uh oh!

elasticmachine commented Jun 10, 2020

Uh oh!

elasticmachine commented Jun 10, 2020

Uh oh!

elasticmachine commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

Steps errors

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blakerouse commented Jun 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalpristas Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

urso Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blakerouse commented Jun 10, 2020 •

edited

Loading

elasticmachine commented Jun 10, 2020 •

edited

Loading

michalpristas Jun 12, 2020 •

edited

Loading

urso Jun 12, 2020 •

edited

Loading