Skip to content

Add geoip processor#14208

Merged
martijnvg merged 1 commit intoelastic:feature/ingestfrom
talevy:ingest/geoip
Nov 3, 2015
Merged

Add geoip processor#14208
martijnvg merged 1 commit intoelastic:feature/ingestfrom
talevy:ingest/geoip

Conversation

@talevy
Copy link
Copy Markdown
Contributor

@talevy talevy commented Oct 20, 2015

Introduces a GeoIP processor. This processor takes in an ip-address and returns city level information
like city_name, lon/lat, country_name, etc.

TODO items:

  • finalize a configuration strategy for where to keep the geo database
  • figure out security policy around file permissions for the db and GeoLite reflection
  • test depends on local access to the GeoLite2-City.mmdb database located at /tmp/geolite2city.mmdb. an actual testing strategy needs to be implemented.

Currently, The GeoLite API requires the use of reflection and access to a class type which is not
supported by the security policy:

reason: uses reflection in a nested fashion when serializing the response json object into a CityResponse

CityResponse response = dbReader.city(ipAddress);

exception:

java.lang.IllegalArgumentException: Can not access public com.maxmind.geoip2.model.CityResponse() (from class com.maxmind.geoip2.model.CityResponse; failed to set access: access denied ("java.lang.reflect.ReflectPermission" "suppressAccessChecks")

@talevy talevy added the :Distributed/Ingest Node Execution or management of Ingest Pipelines label Oct 20, 2015
@talevy talevy added review and removed review labels Oct 21, 2015
@martijnvg martijnvg self-assigned this Oct 28, 2015
@martijnvg martijnvg added the WIP label Oct 28, 2015
@martijnvg martijnvg changed the title [WIP] new processor: geoip Add geoip processor Oct 28, 2015
@martijnvg
Copy link
Copy Markdown
Member

@talevy I've updated the PR with the following changes:

  • Rebased with the feature/ingest branch and applied plugin security. The ingest plugin will allow that the geoip2 library uses reflection to deserialise the raw data into the geoip2's own domain classes.
  • Temporarily added the city data base to the source tree...
  • Custom database files can only be provided via the config directory. In this directory ES has sufficient permissions to read files from. I believe this is the only sane thing we can do for users with custom geoip databases.
  • Replaced the Java integration test with a rest integration test. mvn clean verify passes now.

Things to do:

  • I think we should add another project where we maintain the geoip2 city database. We can then just pull in the database as an dependency in the elasticsearch project. The database is then available for unit tests, integration tests and when the plugin gets bundled without hacking around and adding custom config to maven or the integration test ant xml stuff.
  • Add unit tests.
  • Also support the geoip2 country database.
  • Each geoip processor instance will have its own instance of the geoip2 database. We should prevent this and I think there should be an service where we keep the database instance around per file.

Questions:

  • There is an todo to add a cache for the database lookups. Is this really something we should do? Reading from the code the database file gets memory mapped into memory so it should quick once it is in the filesystem cache. Did you experience any slowness or maybe have measurements? I didn't yet run any tests, but I will do that later on. Maybe just the initial lookups are slow?
  • There is a todo to add the ability to filter out fields retrieved from the geoip database. Instead of adding this feature, maybe users should configure the mutate processor to remove the unwanted fields?

@talevy
Copy link
Copy Markdown
Contributor Author

talevy commented Oct 28, 2015

In response to the Questions:

  • That cache todo for database lookups is just a note to explore the effects of caching. You can see comments about the improved performance of an LRU Cache for logstash-filter-geoip here: WIP Lru cache logstash-plugins/logstash-filter-geoip#40.
  • I am ok with removing that TODO about adding ability to filter out fields. This is a feature of the logstash filter, so I didn't want that feature to be forgotten when implementing the processor. I agree, it can be taken care of by a mutate.

@martijnvg
Copy link
Copy Markdown
Member

That cache todo for database lookups is just a note to explore the effects of caching.

Right, I see, lets then open an issue to explore this.

@martijnvg
Copy link
Copy Markdown
Member

@talevy I moved the database files to an external dependency and the files are now maintained here: https://github.com/elastic/geolite2-databases

The project is already pushed to the maven repository, so it should be available for everyone.

During plugin installation I put the files under $ES_HOME/config/ingest/geoip/instead of embedding the default database files into the jar file. The benefit of this is that both custom db files and the default db files will be handled by the same logic (no exceptions in the logic).

@martijnvg
Copy link
Copy Markdown
Member

@talevy I've updated this PR:

  • Rebased with the feature/geoip branch and made it work with gradle.
  • Removed the Environment usage from the builder and builder factory. The processors shouldn't rely on any ES classes. Add an extra method to the processor builder factory that allows the environment the ingest framework is running in (ES now and later LS) to pass down the configuration directory.
  • Added a service that is private to the geoip processor that builds and caches database readers, so that multiple geoip processors use the same database reader.
  • Added basic docs and some more tests.

I think it is ready.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as the patterns_directory? User may not have permissions to access it from an arbitrary path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nevermind. it gets resolved within the config directory. I see. should I do the same for grok patterns then?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think the grok processor should have its own director in the
'config/ingest' folder like geoip?

On Friday, October 30, 2015, Tal Levy notifications@github.com wrote:

In
plugins/ingest/src/main/java/org/elasticsearch/ingest/processor/geoip/GeoProcessor.java
#14208 (comment)
:

  •    public void setDatabaseFile(String dbPath) {
    
  •        this.databaseFile = dbPath;
    
  •    }
    
  •    public void setTargetField(String targetField) {
    
  •        this.targetField = targetField;
    
  •    }
    
  •    public void fromMap(Map<String, Object> config) {
    
  •        this.ipField = (String) config.get("ip_field");
    
  •        String targetField = (String) config.get("target_field");
    
  •        if (targetField != null) {
    
  •            this.targetField = targetField;
    
  •        }
    
  •        String databaseFile = (String) config.get("database_file");
    

oh, nevermind. it gets resolved within the config directory. I see. should
I do the same for grok patterns then?


Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/pull/14208/files#r43502657.

Met vriendelijke groet,

Martijn van Groningen

@martijnvg
Copy link
Copy Markdown
Member

@talevy I've merged feature/ingest branch into this PR, removed the the maven config and added the annotation.

@talevy
Copy link
Copy Markdown
Contributor Author

talevy commented Nov 3, 2015

LGTM

…on to documents based on an ip address.

The information is fetched from the Maxmind geolite2 database, that is embedded in the ingest plugin.
@martijnvg martijnvg merged commit 0338726 into elastic:feature/ingest Nov 3, 2015
@talevy talevy deleted the ingest/geoip branch November 3, 2015 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Ingest Node Execution or management of Ingest Pipelines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants