Skip to content

[WIP] Introduce the GrokProcessor#14132

Merged
talevy merged 1 commit intoelastic:feature/ingestfrom
talevy:ingest/grok
Nov 3, 2015
Merged

[WIP] Introduce the GrokProcessor#14132
talevy merged 1 commit intoelastic:feature/ingestfrom
talevy:ingest/grok

Conversation

@talevy
Copy link
Copy Markdown
Contributor

@talevy talevy commented Oct 15, 2015

Also moved all processor classes into a subdirectory and introduced a
ConfigException class to be a catch-all for things that can go wrong
when constructing new processors with configurations that possibly throw
exceptions. The GrokProcessor loads patterns from the resources
directory.

Running your first Grok Pipeline

  1. pull changes from this PR

  2. launch a single node with ingest plugin using the IngestRunner

    cd plugins/ingest
    mvn exec:java -Dexec.mainClass="IngestRunner" -Dexec.classpathScope="test"
  3. Find a log file you wish to parse

    # file: logs
    83.109.8.216 [19/Jul/2015:08:13:42 +0000]
    90.149.9.302 [19/Jul/2015:08:13:44 +0000]
    ...
    
  4. Create your desired pipeline in Elasticsearch

    # file: put_pipeline.py
    import requests
    requests.put("http://localhost:9200/_ingest/pipeline/my_pipeline_id", json={
    "description": "simple_pipeline",
    "processors": [{
        "grok": {
          "field": "message",
          "pattern": '%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\]'
        }
      }]
    })
  5. Use the elasticsearch python client to ingest your logs

    from elasticsearch import Elasticsearch,helpers
    es = Elasticsearch()
    
    # parse line into valid json document
    def parse_file(f):
        for line in f:
            yield { 'message': line.strip() }
    
    # read `logs` file and index into elasticsearch
    with open("logs", "r") as f:
        for ok, result in helpers.streaming_bulk(es, parse_file(f), index="test", doc_type="test", params={"ingest": "my_pipeline_id"}):
            action, result = result.popitem()
            # print response
            print result
  6. your documents should be parsed and ready for searching within Elasticsearch

@clintongormley clintongormley added the :Distributed/Ingest Node Execution or management of Ingest Pipelines label Oct 15, 2015
@talevy talevy force-pushed the ingest/grok branch 2 times, most recently from a813799 to f782041 Compare October 16, 2015 10:13
@talevy talevy force-pushed the ingest/grok branch 5 times, most recently from 9a22b72 to 27d1cc2 Compare October 20, 2015 09:10
@talevy talevy added the review label Oct 21, 2015
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream is unfortunately java8 only, since we are likely to back port ingest to 2.x too we should try to avoid using java8 only constructs and apis.

But instead of fixing this, I think we can just remove this method as it is not used for now?

@martijnvg
Copy link
Copy Markdown
Member

This looks great! I know it is WIP, but I left a couple of comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the grok field can be final too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging why this is flaky.

Value of doc.get("val") sometimes equals 123.42 and sometimes equals 123.41999816894531.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/HashMap<String, Object> fields/Map<String, Object> fields

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@martijnvg
Copy link
Copy Markdown
Member

Left a couple more comments. I think we should also add docs for the grok processor to the ingest.asciidoc file, that contains a minimal description & example and describes the possible named patters that can be used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just load the grok expression only from the config dir. (ES_HOME/config/ingest/grok) We just make sure that when we create the distribution we also package the the ingest config into the zip file. Instead of loading stuff from the class path we can get it via the Environment class. (check the geoip PR how it is injected there)

This has as a nice side effect that users can also define their own named grok expressions without us adding extra code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we just load all files in the ES_HOME/config/ingest/grok directory)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused... currently the geoip stores the database under resources/config and falls back on using the classpath to fetch the file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I got it. I followed what jvm-example does to manage its config files. An additional assemblies task is executed in maven to load the config files into config/ingest during runtime so that this is the correct relative path from the environment: https://github.com/elastic/elasticsearch/pull/14132/files#diff-77d47a95d3d1f49700f95a7daff92e13R42

@talevy
Copy link
Copy Markdown
Contributor Author

talevy commented Oct 28, 2015

@martijnvg Similar to your comment about reusing the same GeoIP instance across GeoProcessors. Do you think the same should be done with Grok? As in to use the same Grok instance across all grok processors to avoid re-loading the same configs each time a processor is created?

UPDATE: NEVERMIND

@talevy talevy force-pushed the ingest/grok branch 2 times, most recently from b3cc633 to 3a00cc8 Compare October 28, 2015 23:12
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make the patterns dir configurable? Outside the ES_HOME directory ES has insufficient permissions to read files. I think the patterns dir should always be $ES_HOME/config/ingest/grok/patterns.

@talevy talevy force-pushed the ingest/grok branch 4 times, most recently from a51728f to 1485785 Compare October 29, 2015 23:47
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the custom task isn't needed, if all resources are placed under src/main/packaging/config.

@rjernst added logic that bundles custom files that is placed in src/main/packaging into the plugin zip file: https://github.com/elastic/elasticsearch/blob/master/buildSrc/src/main/groovy/org/elasticsearch/gradle/plugin/PluginBuildPlugin.groovy#L97

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please use src/main/packaging!

@martijnvg
Copy link
Copy Markdown
Member

@talevy left two minor comments, other then that LGTM.

Also moved all processor classes into a subdirectory and introduced a
ConfigException class to be a catch-all for things that can go wrong
when constructing new processors with configurations that possibly throw
exceptions. The GrokProcessor loads patterns from the resources
directory.

fix resource path issue, and add rest-api-spec test for grok

fix rest-spec tests

changes: license, remove configexception, throw IOException

add more tests and fix iso8601-hour pattern

move grok patterns from resources to config

fix tests with pom changes, updated IngestClientIT with grok processor

update gradle build script for grok deps and test configuration

move config files to src/main/packaging

move Env out of Processor, fix test for src/main/packaging change

add docs

clean up test resources task

update Grok to be immutable

- Updated the Grok class to be immutable. This means that all the
  pattern bank loading is handled by an external utility class called
  PatternUtils.
- fixed tabs in the nagios patterns file's comments
talevy added a commit that referenced this pull request Nov 3, 2015
[WIP] Introduce the GrokProcessor
@talevy talevy merged commit 6e99c71 into elastic:feature/ingest Nov 3, 2015
@talevy talevy deleted the ingest/grok branch November 3, 2015 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Ingest Node Execution or management of Ingest Pipelines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants