Skip to content

Separate bulk upload behavior from CollectorSet#19691

Merged
tsullivan merged 36 commits intoelastic:masterfrom
tsullivan:usage/separate-bulk-behavior-from-collectorset
Jun 21, 2018
Merged

Separate bulk upload behavior from CollectorSet#19691
tsullivan merged 36 commits intoelastic:masterfrom
tsullivan:usage/separate-bulk-behavior-from-collectorset

Conversation

@tsullivan
Copy link
Copy Markdown
Member

@tsullivan tsullivan commented Jun 5, 2018

Part of #19534
Closes #19567

This PR prepares the way for the CollectorSet object to be moved to kbn_server in core and instantiated before the Kibana plugins are.

Changes

  • takes out a lot of behavior from CollectorSet and moves it to a class called BulkUploader
  • simplifies kibana monitoring init by taking out the indirection that startCollectorSet / createCollectorSet had
  • removes start() method from CollectorSet and calls the collector objects' init() functions from CollectorSet.register()
  • removes cleanup method from collectorSet since that was doing work for bulk uploading
  • keeps start/stop methods in the new BulkUploader class

Testing:

One of the most important things to ensure with this PR is that Kibana monitoring and telemetry still works.

  • For Kibana monitoring, the Kibana instance dashboards should still show up-to-date operational information. Stopping and restarting Elasticsearch should work, meaning that no (new) server errors are logged, and when Elasticsearch starts again, the dashboards keep updating on their own.
  • For telemetry, we must make sure the /api/_xpack/usage HTTP endpoint still shows all the same information as it does in the master branch.

Kibana Ops monitoring resumes even if Elasticsearch was down for a minute (Kibana server did not restart):
image

@tsullivan tsullivan added WIP Work in progress v6.4.0 Team:Monitoring Stack Monitoring team Feature:Telemetry labels Jun 5, 2018
@tsullivan tsullivan requested a review from pickypg June 5, 2018 23:24
Copy link
Copy Markdown
Member

@pickypg pickypg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can work in the start/stop logic for the collector set (or just an enabled/disabled state).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that this method was added as part of the constructor, while the other two below get called out separately? The old version was its own class method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is because I don't want this class to have a this.collectorSet, which it would need if start() wasn't able to reach the collectorSet via enclosure. It's not a big deal, just my personal preference.

What do you think about that reason?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the reasoning. Perhaps add a comment to note that's why?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dangerous race condition. If Kibana starts up before/while Elasticsearch is starting (or happens to be non-responsive for some reason during this time period), then Kibana monitoring is guaranteed to be disabled until Kibana is restarted. Instead, I think that we would be better off stopping/starting the collector based on this logic (or, continuing to blindly start it). Since we can listen to the xpack_main.info object, we should be safe to do that.

Also, by not returning the collectorSet, we force other plugins to validate the CollectorSet's existence before using it in addition to the monitoring plugin being enabled.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next version of this code, collectorSet will be moved to src/server/usage in core Kibana. There, it will be initiated before any of the plugins and core plugins are. Plugins will access it like:

const { collectorSet } = server.usage;

This part of the code looks like a miss in the refactoring - the CollectorSet constructor only needs server passed in, and it needs that for logging.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectorSet doesn't need to have conditional instantiation/startup, but the bulk uploader's start method probably does

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since initKibanaMonitoring (as this PR writes it) can return undefined, doesn't this fail hard in that scenario?

Copy link
Copy Markdown
Member Author

@tsullivan tsullivan Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the previous comments on this area, I still need to handle stopping/restarting the uploader when xpack_main plugin goes red/back to green, and check the info before starting again

tsullivan added 9 commits June 8, 2018 10:39
 - takes out a lot of behavior from CollectorSet and moves it to a class called BulkUploader
 - simplifies kibana monitoring init by taking out the indirection that startCollectorSet / createCollectorSet had
 - removes start() method from CollectorSet and calls the collector objects' init() functions from CollectorSet.register()
 - removes cleanup method from collectorSet since that was doing work for bulk uploading
@tsullivan tsullivan force-pushed the usage/separate-bulk-behavior-from-collectorset branch from 45c3f1a to 3046ff0 Compare June 8, 2018 21:58
@tsullivan
Copy link
Copy Markdown
Member Author

tsullivan commented Jun 8, 2018

I'm backing off on my comment about breaking out some of the changes to a smaller PR. When I started going that way, the work it took to get those small changes working with no regressions were almost as big as this one.

@tsullivan tsullivan force-pushed the usage/separate-bulk-behavior-from-collectorset branch from 3046ff0 to 486c64e Compare June 8, 2018 23:36
@tsullivan
Copy link
Copy Markdown
Member Author

tsullivan commented Jun 8, 2018

@pickypg I updated code per your feedback about needing a start / stop. I only added a stop method to BulkUploader, the CollectorSet doesn't need start / stop methods.

If Elasticsearch is down, we'll have to handle that possibility in the API route handlers that call collectorSet.bulkFetch() using try/catch. Even right now though, most APIs do not act well responding appropriately if Elasticsearch is down - as you can see from my Client Response Time screenshot.

@tsullivan tsullivan added review v7.0.0 and removed WIP Work in progress labels Jun 8, 2018
});
} catch(err) {
req.log(['error'], err);
req.log(['error'], err); // FIXME doesn't seem to log anything useful if ES times out
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIXME comment here seems to be the case for any API that relies on ES, but I want to look into it more later

*/
collectorSet.register(getOpsStatsCollector(server));
collectorSet.register(getKibanaUsageCollector(server));
collectorSet.register(getSettingsCollector(server));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering a collector doesn't need to be delayed on anything. It's only when fetch is called on them that we want to try to make sure Elasticsearch is up. The bulkUploader handles this by watching elasticsearch plugin status. This isn't happening robustly from the API caller yet, but I have some plans to make that better in a later PR.

);
}

monitoringPlugin.status.green('Ready');
Copy link
Copy Markdown
Member Author

@tsullivan tsullivan Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not a good idea for an earlier PR to add this - having the monitoring plugin status always be green was supposed to be a feature

type: KIBANA_STATS_TYPE,
init,
fetch: buffer.flush,
fetchAfterInit: true,
Copy link
Copy Markdown
Member Author

@tsullivan tsullivan Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to pull the removal of fetchAfterInit and cleanup as properties of the Collector constructor argument object, but getting that working without regressions of its own turned out to be a huge change that would have been temporary.

@elastic elastic deleted a comment from elasticmachine Jun 19, 2018
@elastic elastic deleted a comment from elasticmachine Jun 19, 2018
@elastic elastic deleted a comment from elasticmachine Jun 19, 2018
@elastic elastic deleted a comment from elasticmachine Jun 19, 2018
['info', LOGGING_TAG, KIBANA_MONITORING_LOGGING_TAG],
'Internal collection for Kibana monitoring will is disabled per configuration.'
);
} else if (!xpackMainInfo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server.plugins.xpack_main.info is a service, so if it's missing it doesn't mean that X-Pack info couldn't be fetched, but that the plugin has changed in some way. I think this block can just be deleted.

* @return undefined
*/
start(collectorSet) {
this._log.info('Starting monitoring stats collection');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this should go after the check for this._timer

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this comment didn't become outdated /shrug


if (!monitoringBulkEnabled) {
// debug level to not spam the logs
this._log.debug(`Skipping fetch and upload of monitoring stats due to monitoring bulk endpoint not being available`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we shouldn't call this.stop()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to check at an interval protects the possibility that the bulk endpoint in ES can be disabled with a dynamic setting (not sure if that is possible), or that the ES configuration changes without triggering an ES plugin state change in Kibana. That seems like it is possible with hot-swapping ES for one with a different configuration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsullivan The endpoint will not change based on the license, nor can it be changed dynamically (the endpoint either exists, or it doesn't because xpack.monitoring.enabled: false is explicitly set). However, calling this.stop() would require the xpack_main plugin to go from green to another state, then back to green for the service to be restarted, which is not a requirement for a license change.

Relative to the next comment, I think that is the superior option and then we can add this logic there.

if (kibanaCollectionEnabled && xpackMainInfo) {
const bulkUploader = initBulkUploader(monitoringPlugin.kbnServer, server, xpackMainInfo);

xpackMainPlugin.status.on('green', async () => { // any time xpack_main turns green
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to switch this over to xpackInfo.onLicenseInfoChange() once #20018 is merged

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that listener support, I think we can safely do that and I think we can then back out the license checking code from the BulkUploader itself per collection and put it back into such a listener.

const data = await collectorSet.bulkFetch(this._callClusterWithInternalUser);
const usableData = data.filter(d => Boolean(d) && !isEmpty(d.result));
const payload = usableData.map(({ result, type }) => {
if (!isEmpty(result)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is redundant because of the .filter() right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also become more readable to just change this to

const payload = data.filter(d => Boolean(d) && !isEmpty(d.result)).map(({ result, type }) => [{ index: { _type: type } }, result]);

@elastic elastic deleted a comment from elasticmachine Jun 20, 2018
@tsullivan
Copy link
Copy Markdown
Member Author

unrelated visualization test failure (Vega)

jenkins test this

@tsullivan
Copy link
Copy Markdown
Member Author

different unrelated failures this time:

22:07:57            ?- ? fail: "dashboard app using legacy data dashboard state Directly modifying url updates dashboard state for embeddable config color parameters on a visualization updates a pie slice color on a soft refresh"
22:08:44        ?- ? fail: "dashboard app using legacy data dashboard state "after all" hook"
``

@tsullivan
Copy link
Copy Markdown
Member Author

jenkins test this

@elastic elastic deleted a comment from elasticmachine Jun 20, 2018
@elastic elastic deleted a comment from elasticmachine Jun 20, 2018
@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Copy link
Copy Markdown
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, but otherwise looks good!

monitoringPlugin.status.green('Ready');
// xpack main plugin status goes red if ES client lost connection
xpackMainPlugin.status.on('red', () => {
bulkUploader.handleConnectionLost();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary and will cause problems because we are stopping the bulkUploader without starting again. Instead I think the bulk uploader should just try to talk to elasticsearch and fail if it can't. If we really want to get rid of the errors while Kibana is a red status we can check xpackMainPlugin.status.state !== 'red' before logging about errors talking to elasticsearch or something, but I suggest avoiding toggling the bulkUploader based on plugin status.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopping the bulkUploader without starting again

It will get restarted when the the xpack main plugin status change causes the license info to change, but yeah - that is a pretty indirect signal.

we can check xpackMainPlugin.status.state !== 'red'

I would have liked to not log avoidable errors, and checking the xpack main plugin state feels like something that should be handled by the platform with. Earlier in this code, I was passing xpackMainInfo into the bulk uploader constructor, and I really like that I'm not doing that in the latest version.

It sounds like checking the status to avoid logging errors is a nice-to-have. I like the idea of simplifying the code, so I'll just remove this status handler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

@tsullivan tsullivan merged commit d0595fd into elastic:master Jun 21, 2018
@tsullivan tsullivan deleted the usage/separate-bulk-behavior-from-collectorset branch June 21, 2018 20:08
@tsullivan
Copy link
Copy Markdown
Member Author

spencer gave verbal lgtm

tsullivan added a commit to tsullivan/kibana that referenced this pull request Jun 21, 2018
* Separate bulk upload behavior from CollectorSet

 - takes out a lot of behavior from CollectorSet and moves it to a class called BulkUploader
 - simplifies kibana monitoring init by taking out the indirection that startCollectorSet / createCollectorSet had
 - removes start() method from CollectorSet and calls the collector objects' init() functions from CollectorSet.register()
 - removes cleanup method from collectorSet since that was doing work for bulk uploading

* remove cleanup and fetchAfterInit methods

* test for bulk_uploader class

* improve test for collector_set

* fix reporting

* expose collectorSet if there actually is a collectorSet

* comment for enclosed function

* make collectorSet creation/expose unconditional, bulkUploader more conditional

* fix collector_set tests

* lifecycle events

* stab at collectorSet error logging from the API call

* clean up comments

* clean up comments

* fix BulkUploader mocha test

* check kibanaCollectionEnabled config before registering bulk upload and the plugin status listeners

* no singleton timer object

* just log a warning if bulk uploader start called twice

* normal quotes

* check if bulk is enabled inside of the _fetchAndUpload method

* log for stopping bulk stats

* call bulkUploader.start with the collectorSet object

* call bulkUploader.start with the collectorSet object

* roll back change for module scoped variable

* oops I broke init

* init and logging: if / elseif / elseif

* remove unnecessary check/log

* help log

* remove redundant, use data.filter.map

* use xpackInfo.onLicenseInfoChange not xpackMainPlugin.status.on('green')

* help logging

* fix unit test

* remove handler that stops upload when connection is lost
tsullivan added a commit that referenced this pull request Jun 21, 2018
* Separate bulk upload behavior from CollectorSet

 - takes out a lot of behavior from CollectorSet and moves it to a class called BulkUploader
 - simplifies kibana monitoring init by taking out the indirection that startCollectorSet / createCollectorSet had
 - removes start() method from CollectorSet and calls the collector objects' init() functions from CollectorSet.register()
 - removes cleanup method from collectorSet since that was doing work for bulk uploading

* remove cleanup and fetchAfterInit methods

* test for bulk_uploader class

* improve test for collector_set

* fix reporting

* expose collectorSet if there actually is a collectorSet

* comment for enclosed function

* make collectorSet creation/expose unconditional, bulkUploader more conditional

* fix collector_set tests

* lifecycle events

* stab at collectorSet error logging from the API call

* clean up comments

* clean up comments

* fix BulkUploader mocha test

* check kibanaCollectionEnabled config before registering bulk upload and the plugin status listeners

* no singleton timer object

* just log a warning if bulk uploader start called twice

* normal quotes

* check if bulk is enabled inside of the _fetchAndUpload method

* log for stopping bulk stats

* call bulkUploader.start with the collectorSet object

* call bulkUploader.start with the collectorSet object

* roll back change for module scoped variable

* oops I broke init

* init and logging: if / elseif / elseif

* remove unnecessary check/log

* help log

* remove redundant, use data.filter.map

* use xpackInfo.onLicenseInfoChange not xpackMainPlugin.status.on('green')

* help logging

* fix unit test

* remove handler that stops upload when connection is lost
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants