Enable analytics geoip in behavioral analytics. by afoucret · Pull Request #96624 · elastic/elasticsearch

afoucret · 2023-06-06T15:58:20Z

The default pipeline used by behavioral analytics contains a geo ip processor and is installed through an IndexTemplateRegistry.

With the current implementation, it means that the GeoIpDownloader will be run as soon as Elasticsearch server will start.
We do not want this because it is not optimal for users that do not use behavioral analytcis.

This PR add an additional flag optionalgeoip_database_lazy_download in the pipeline _meta to determine if the pipeline install should trigger the download or not

  "processors": [
      { 
         "geoip": {}
      }
  ],
  "_meta": {
      "geoip_database_lazy_download": true
  }

If the flag is missing or set to false the behavior is unchanged.
If the flag is set to true, the geoip downloader will be triggered only when an index exists with the pipeline set has default_pipeline or final_pipeline

elasticsearchmachine · 2023-06-06T15:58:44Z

Pinging @elastic/ent-search-eng (Team:Enterprise Search)

afoucret · 2023-06-09T09:02:31Z

...s/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java

ℹ️ For all pipeline that have _meta.geoip_database_lazy_download set to false, the download is triggered only when an index with the pipeline set as default_pipeline or final_pipeline exists.

afoucret · 2023-06-09T09:03:20Z

...s/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java

ℹ️ Now we need to check if an index is created with the pipeline set.

afoucret · 2023-06-09T09:05:11Z

.../org/elasticsearch/xpack/entsearch/analytics/behavioral_analytics-events-final_pipeline.json

ℹ️ Adding geoip_database_lazy_download to our pipeline, so the database is downloaded only when an Analytics Collection is created.

afoucret · 2023-06-09T09:06:24Z

x-pack/qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/GeoIpUpgradeIT.java

ℹ️ Enable again this test for all versions. not have been disabled.

afoucret · 2023-06-09T09:06:33Z

...ources/org/elasticsearch/xpack/entsearch/analytics/behavioral_analytics-events-mappings.json

ℹ️ Adding a tags field to events as it is used by the geoip processor.

elasticsearchmachine · 2023-06-09T10:00:53Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2023-06-09T10:00:53Z

Hi @afoucret, I've created a changelog YAML for you.

jimczi

I like the approach. We shouldn't have to switch ingest pipeline under the hood to avoid the eager download at startup.
I approved for our side so let's wait for @elastic/es-data-management's feedback now.

afoucret · 2023-06-09T10:08:38Z

Also asked a review to @eyalkoren cause he was involved in the index template registry stuff

afoucret · 2023-06-09T10:09:30Z

@elasticsearchmachine run elasticsearch-ci/doc-check

afoucret · 2023-06-09T10:09:54Z

@elasticsearchmachine run elasticsearch-ci/part-1

eyalkoren · 2023-06-11T13:25:40Z

@afoucret thanks for the trust, though I am really not in a position to review this area as I am still learning it myself 😊
Someone from the @elastic/es-data-management would be a better fit.

masseyke · 2023-06-12T12:48:21Z

docs/reference/ingest/processors/geoip.asciidoc

Maybe worth mentioning ingest.geoip.downloader.eager.download? Something like:
If `true` (and if `ingest.geoip.downloader.eager.download` is false), the missing database is downloaded when the pipeline is created. Else, the download is triggered by when the pipeline is used as the `default_pipeline` or `final_pipeline` in an index.

🙇 Added your change

masseyke · 2023-06-12T15:23:40Z

...est-geoip/src/internalClusterTest/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderIT.java

Shouldn't this be false? We want to not download on pipeline creation for this test right?

And related, we probably ought to have a test that sets the value to true and checks that it does download right? Like

putGeoIpPipeline(pipelineId, true); assertBusy(() -> assertNotNull(getTask().getState()));

And another good test might be to create an index at this point that does not have a geoip processor in its pipeline, to make sure the cluster state change listener doesn't trigger the download when just any index is created.

I fixed the test and added few more step to it as suggested.

masseyke · 2023-06-12T16:19:08Z

...s/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java

Should we check whether it has a geoip processor before potentially looping through all of the indices to see if it's used? I'm a little worried about performance here.

It might be best to collect the pipelines that have geoip processors together first while noting which of those pipelines have geoip processors that all have the "download on index created" setting set to true. If there are any pipelines that don't have that setting present, we can skip reading the indices and start the download. If they all have the setting, then we can check to see if any indices have a default/final pipeline usage that references one of the noted pipelines.

I did some change:

We collect all the pipeline downlaod_database_on_pipeline_creation being true and return true if not empty

We collect all the pipeline download_database_on_pipeline_creation being false and return false if empty

We loop over the indices to check if one of the collected pipeline is referenced only if we did not fall into an early return case.

OK just to confirm, I think that the worst case performance is now:
Every time someone adds/modifies/deletes an index or adds/modifies/deletes a pipeline (so fairly often), for each pipeline in the cluster state with a geoip processor with "download_database_on_pipeline_creation" set (probably relatively few), we look through each index in the cluster state (potentially 50k or more) to see if that is the default or final pipeline.
That check will happen more often than I'd like but I think that'll be acceptably fast.

masseyke · 2023-06-12T17:32:49Z

...est-geoip/src/internalClusterTest/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderIT.java

I think you want the ability to set this to false for the test right? It defaults to true.

Good catch.
The condition has been updated to:

if (downloadDatabaseOnPipelineCreation == false || randomBoolean()) {

The random boolean allow to randomize testing between download_database_on_pipeline_creation missing or set to true

masseyke

I think that the integration test needs to be updated (I think maybe it just wasn't fully updated after the property name changed). I also have some concerns about possible performance problems in the cluster state change listener.

…an index exists for the pipeline.

…ead of managed.

…decide database download strategy.

afoucret · 2023-06-13T18:19:56Z

@masseyke I did update the PR according to your feedback. Would be nice if you could re-review it. Thank you.

masseyke · 2023-06-15T19:06:40Z

...s/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java

-    private static boolean hasAtLeastOneGeoipProcessor(List<Map<String, Object>> processors) {
-        return processors != null && processors.stream().anyMatch(GeoIpDownloaderTaskExecutor::hasAtLeastOneGeoipProcessor);
+    @SuppressWarnings("unchecked")
+    private static List<PipelineConfiguration> pipelineConfigurationsWithGeoIpProcessor(


It would be good to have some javadocs for these methods. Specifically it would be good documenting what the input args are (the code makes sense when you get into it, but it's not intuitive to me what impact downloadDatabaseOnPipelineCreation has on the pipelineConfigurationsWithGeoIpProcessor that are returned from this method from just looking at the method signature)..

masseyke

It would be great to see some more javadocs in the new methods in GeoIpDownloaderTaskExecutor, but the approach overall seems good to me and I think you've addressed the worst of the performance problems. Thanks for all the work on this.

jbaiera

LGTM, left one small nit but otherwise thank you for iterating!

jbaiera · 2023-06-15T20:21:08Z

...s/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java

+            return true;
+        }
+
+        List<String> checkReferencedPipelines = pipelineConfigurationsWithGeoIpProcessor(clusterState, false).stream()


Could this be a Set since all interactions with it are via contains?

Good idea. I did pushed an update to use a Set

afoucret added >non-issue :EnterpriseSearch/Application Enterprise Search Team:Enterprise Search Meta label for Enterprise Search team v8.9.0 labels Jun 6, 2023

afoucret commented Jun 9, 2023

View reviewed changes

jimczi added :Distributed/Ingest Node Execution or management of Ingest Pipelines Team:Data Management (obsolete) DO NOT USE. This team no longer exists. >feature and removed >non-issue labels Jun 9, 2023

jimczi approved these changes Jun 9, 2023

View reviewed changes

afoucret requested a review from eyalkoren June 9, 2023 10:08

afoucret removed the :EnterpriseSearch/Application Enterprise Search label Jun 9, 2023

elasticsearchmachine removed the Team:Enterprise Search Meta label for Enterprise Search team label Jun 9, 2023

masseyke assigned masseyke and unassigned masseyke Jun 9, 2023

eyalkoren removed their request for review June 11, 2023 13:25

masseyke reviewed Jun 12, 2023

View reviewed changes

masseyke requested changes Jun 12, 2023

View reviewed changes

afoucret requested a review from masseyke June 13, 2023 16:18

When using a managed pipeline GeoIpDownloader is triggered only when …

823e276

…an index exists for the pipeline.

Aurelien FOUCRET and others added 14 commits June 13, 2023 18:21

lint

7ee7a2e

Adding an integration tests for managed pipelines.

c8158fc

lint

051d5d4

Add a geoip_database_lazy_download param to pipelines and use it inst…

c257f07

…ead of managed.

Fix a edge case: pipeline can be set after index is created.

9a3a079

lint.

abb80c0

Update docs/changelog/96624.yaml

f44caf6

Update 96624.yaml

728c18f

Uses a processor setting (download_database_on_pipeline_creation) to …

199ccd2

…decide database download strategy.

Removing debug instruction.

e4db7aa

Improved documentation.

ca4e909

Improved the way to check for referenced pipelines.

62a6ab8

Fixing an error in test.

ef123f7

Improved integration tests.

c3baf3d

afoucret force-pushed the enable-analytics-geip branch from 458ecb9 to c3baf3d Compare June 13, 2023 16:22

Lint.

6489976

Aurelien FOUCRET added 2 commits June 15, 2023 09:06

Fix failing tests.

6ade089

Fix failing tests (2).

90e65fa

masseyke reviewed Jun 15, 2023

View reviewed changes

masseyke approved these changes Jun 15, 2023

View reviewed changes

Aurelien FOUCRET added 2 commits June 15, 2023 22:04

Adding javadoc.

9dc3f81

lint javadoc.

31082be

jbaiera approved these changes Jun 15, 2023

View reviewed changes

Aurelien FOUCRET and others added 2 commits June 15, 2023 22:34

Using a set instead of a list to store checked pipelines.

22aa720

Merge branch 'elastic:main' into enable-analytics-geip

022a6ec

afoucret merged commit dd1d157 into elastic:main Jun 15, 2023

afoucret deleted the enable-analytics-geip branch June 5, 2025 06:46

masseyke mentioned this pull request Jul 14, 2025

Correctly handling download_database_on_pipeline_creation within a pipeline processor within a default or final pipeline #131236

Merged

Conversation

afoucret commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 6, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afoucret Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afoucret Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jun 9, 2023

Uh oh!

elasticsearchmachine commented Jun 9, 2023

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

afoucret commented Jun 9, 2023

Uh oh!

afoucret commented Jun 9, 2023

Uh oh!

afoucret commented Jun 9, 2023

Uh oh!

eyalkoren commented Jun 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke left a comment

Choose a reason for hiding this comment

Uh oh!

afoucret commented Jun 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke left a comment

Choose a reason for hiding this comment

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

afoucret commented Jun 6, 2023 •

edited

Loading

afoucret Jun 9, 2023 •

edited

Loading

afoucret Jun 9, 2023 •

edited

Loading