Enable analytics geoip in behavioral analytics.#96624
Enable analytics geoip in behavioral analytics.#96624afoucret merged 26 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/ent-search-eng (Team:Enterprise Search) |
There was a problem hiding this comment.
ℹ️ For all pipeline that have _meta.geoip_database_lazy_download set to false, the download is triggered only when an index with the pipeline set as default_pipeline or final_pipeline exists.
There was a problem hiding this comment.
ℹ️ Now we need to check if an index is created with the pipeline set.
There was a problem hiding this comment.
ℹ️ Adding geoip_database_lazy_download to our pipeline, so the database is downloaded only when an Analytics Collection is created.
There was a problem hiding this comment.
ℹ️ Enable again this test for all versions. not have been disabled.
There was a problem hiding this comment.
ℹ️ Adding a tags field to events as it is used by the geoip processor.
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @afoucret, I've created a changelog YAML for you. |
jimczi
left a comment
There was a problem hiding this comment.
I like the approach. We shouldn't have to switch ingest pipeline under the hood to avoid the eager download at startup.
I approved for our side so let's wait for @elastic/es-data-management's feedback now.
|
Also asked a review to @eyalkoren cause he was involved in the index template registry stuff |
|
@elasticsearchmachine run elasticsearch-ci/doc-check |
|
@elasticsearchmachine run elasticsearch-ci/part-1 |
|
@afoucret thanks for the trust, though I am really not in a position to review this area as I am still learning it myself 😊 |
There was a problem hiding this comment.
Maybe worth mentioning ingest.geoip.downloader.eager.download? Something like:
If `true` (and if `ingest.geoip.downloader.eager.download` is false), the missing database is downloaded when the pipeline is created. Else, the download is triggered by when the pipeline is used as the `default_pipeline` or `final_pipeline` in an index.
There was a problem hiding this comment.
🙇 Added your change
There was a problem hiding this comment.
Shouldn't this be false? We want to not download on pipeline creation for this test right?
There was a problem hiding this comment.
And related, we probably ought to have a test that sets the value to true and checks that it does download right? Like
putGeoIpPipeline(pipelineId, true);
assertBusy(() -> assertNotNull(getTask().getState()));
There was a problem hiding this comment.
And another good test might be to create an index at this point that does not have a geoip processor in its pipeline, to make sure the cluster state change listener doesn't trigger the download when just any index is created.
There was a problem hiding this comment.
I fixed the test and added few more step to it as suggested.
There was a problem hiding this comment.
Should we check whether it has a geoip processor before potentially looping through all of the indices to see if it's used? I'm a little worried about performance here.
There was a problem hiding this comment.
It might be best to collect the pipelines that have geoip processors together first while noting which of those pipelines have geoip processors that all have the "download on index created" setting set to true. If there are any pipelines that don't have that setting present, we can skip reading the indices and start the download. If they all have the setting, then we can check to see if any indices have a default/final pipeline usage that references one of the noted pipelines.
There was a problem hiding this comment.
I did some change:
- We collect all the pipeline
downlaod_database_on_pipeline_creationbeingtrueand return true if not empty - We collect all the pipeline
download_database_on_pipeline_creationbeingfalseand return false if empty - We loop over the indices to check if one of the collected pipeline is referenced only if we did not fall into an early return case.
There was a problem hiding this comment.
OK just to confirm, I think that the worst case performance is now:
Every time someone adds/modifies/deletes an index or adds/modifies/deletes a pipeline (so fairly often), for each pipeline in the cluster state with a geoip processor with "download_database_on_pipeline_creation" set (probably relatively few), we look through each index in the cluster state (potentially 50k or more) to see if that is the default or final pipeline.
That check will happen more often than I'd like but I think that'll be acceptably fast.
There was a problem hiding this comment.
I think you want the ability to set this to false for the test right? It defaults to true.
There was a problem hiding this comment.
Good catch.
The condition has been updated to:
if (downloadDatabaseOnPipelineCreation == false || randomBoolean()) {
The random boolean allow to randomize testing between download_database_on_pipeline_creation missing or set to true
masseyke
left a comment
There was a problem hiding this comment.
I think that the integration test needs to be updated (I think maybe it just wasn't fully updated after the property name changed). I also have some concerns about possible performance problems in the cluster state change listener.
…an index exists for the pipeline.
…decide database download strategy.
458ecb9 to
c3baf3d
Compare
|
@masseyke I did update the PR according to your feedback. Would be nice if you could re-review it. Thank you. |
| private static boolean hasAtLeastOneGeoipProcessor(List<Map<String, Object>> processors) { | ||
| return processors != null && processors.stream().anyMatch(GeoIpDownloaderTaskExecutor::hasAtLeastOneGeoipProcessor); | ||
| @SuppressWarnings("unchecked") | ||
| private static List<PipelineConfiguration> pipelineConfigurationsWithGeoIpProcessor( |
There was a problem hiding this comment.
It would be good to have some javadocs for these methods. Specifically it would be good documenting what the input args are (the code makes sense when you get into it, but it's not intuitive to me what impact downloadDatabaseOnPipelineCreation has on the pipelineConfigurationsWithGeoIpProcessor that are returned from this method from just looking at the method signature)..
masseyke
left a comment
There was a problem hiding this comment.
It would be great to see some more javadocs in the new methods in GeoIpDownloaderTaskExecutor, but the approach overall seems good to me and I think you've addressed the worst of the performance problems. Thanks for all the work on this.
jbaiera
left a comment
There was a problem hiding this comment.
LGTM, left one small nit but otherwise thank you for iterating!
| return true; | ||
| } | ||
|
|
||
| List<String> checkReferencedPipelines = pipelineConfigurationsWithGeoIpProcessor(clusterState, false).stream() |
There was a problem hiding this comment.
Could this be a Set since all interactions with it are via contains?
There was a problem hiding this comment.
Good idea. I did pushed an update to use a Set
The default pipeline used by behavioral analytics contains a geo ip processor and is installed through an IndexTemplateRegistry.
With the current implementation, it means that the GeoIpDownloader will be run as soon as Elasticsearch server will start.
We do not want this because it is not optimal for users that do not use behavioral analytcis.
This PR add an additional flag optional
geoip_database_lazy_downloadin the pipeline_metato determine if the pipeline install should trigger the download or notIf the flag is missing or set to
falsethe behavior is unchanged.If the flag is set to true, the geoip downloader will be triggered only when an index exists with the pipeline set has
default_pipelineorfinal_pipeline