Ingest Node - enrich data as it gets in via pipelines

[[currently open issues](https://github.com/elastic/elasticsearch/pulls?q=is%3Aopen+is%3Apr+label%3A%22%3Aingest%22)]

related issues from other projects:
- Beats: https://github.com/elastic/beats/issues/805, Merged! Hurray!
- Kibana: https://github.com/elastic/kibana/issues/5974

There are many use-cases where it is important to enrich incoming data. This enrichment may be something simple like using a regular expression to extract metadata from an existing field, or something more advanced like a geoip lookup or language identification. The filter stage of the [Logstash processing pipeline](https://www.elastic.co/guide/en/logstash/current/pipeline.html#_filters) provides great examples of the ways in which data is often enriched. Node ingest implements a new type of ES node, which performs this enrichment prior to indexing.

Node ingest is a pure Java implementation of the filters in logstash, integrated with Elasticsearch. It works by wrapping the bulk/index APIs, executing a pipeline that is composed of multiple processors to enrich the documents. A processor is just a component that can modify incoming documents (the source before ES turns it into a document). A pipeline is a list of processors grouped under an unique id. If node ingest is enabled then the index and bulk apis can reroute the the request with documents through a pipeline.

The ingest plugin runs on dedicated client nodes and after bulk and index requests have been enriched these index and bulk request continue their way into the cluster.

Node ingest will be a plugin in the elasticsearch project, implementing 2 main aspects:

The first is a pure Java implementation for Pipeline, Processor, as well as initial processor implementation of grok, geoip, kv/mutate, date. This java implementation can then be reused in others places, such as logstash itself, reindex API, and so on. In the first version of the ingest plugin the processor implementations can reside in the ingest plugin, but the framework and processor implementations shouldn’t rely on any ES specific code, so that later on it can be moved to an isolated library.

The second part is the integration with Elasticsearch. This includes interception of the bulk/index APIs, management APIs (stats and so on in future phase), storage and live reload of the configuration, supporting multiple "live" pipelines, and simulation of pipeline execution.

The goal of the ingest plugin is to make data enrichment easier and it will not replace logstash at all. The ingest plugin should make data enrichment in most of the cases easier when events are only stored in Elasticsearch. For example when only file beat is used to ship logs, a logstash instance will no longer be required. In cases where events are stored in multiple outputs a Logstash installation is required. Also at some point Logstash will reuse the pipeline/processor framework, so the end goal is that both Elasticsearch and Logstash will benefit from the ingest initiative.

Development happens in a feature branch: https://github.com/elastic/elasticsearch/tree/feature/ingest

Current node ingest tasks:
- [x] Hook into the index and bulk APIs. If the ingest plugin is enabled a `pipeline_id` parameter is available to select what pipeline should be used to preprocess the documents before the index/bulk APIs get executed. #13941
- [x] Manage pipeline configuration. Pipelines are stored as a document in an index. Each node ingest node will have the pipelines in memory around to be used when needed. A background process makes sure that an ingest node will eventually get the modifications. #13941
- [x] The pipeline document enrichment shouldn't be non blocking and happen via a dedicated thread pool. #13990
- [x] Add first version of CRUD pipeline APIs. #14047
- [x] Data substructure manipulation. Processors should be able to introduce new nested fields with no pre-existing parent structure. For example, It should be possible to create a new field "location.lat" without "location" existing. #14250
- [x] Add grok processor. #14132
- [x] Add geoip processor. #14208
- [x] Add date processor. #14184 
- [x] Add kv/mutate processor. #14253 
- [x] Strict configuration validation #14552
- [x] geoip processor output fields should be configurable #14582
- [x] Add simulate API. This allows pipelines to be tested out before actually being used. This api accepts a pipeline definition and actual documents and the output is the transformed documents and optionally showing how each document gets modified after each processor. #14572 
- [x] Do not fail whole bulk request if any pipeline for a single document fails #14888
- [x] Split mutate processor into separate processors for each function (e.g. update, remove, etc) #14938
- [x] Add support for setting nested fields in document #14250
- [x] Data and metadata manipulation.  #14644
- [x] Processors and factors should throw exception on the interface.
- [x] Throw exception when grok expression does not match #15132 
- [x] Add ability to provide custom patterns within Grok Processor config definition #15167 
- [x] Reduce number of fields operated on by processors to just one. #15133 
- [x] Support for ingest transient metadata #15036
- [x] on failure pipeline handler #14548 / #15565 (tal)
- [x] append processor #14324 #15577
- [x] Ingest nodes should update pipelines in a sync manner #14998 / #15203 (mvg)
- [x] support for templating in any processor that sets a field value #14990 #15415 (mvg)
- [x] Add support for ingest node boolean flag to enable/disable ingest at the node level. If a node with `node.ingest` set to false receives an ingest request, it should explicitly fail. (mvg) #15610
- [x] Figure out if geoip2 library can be used without suppressing jvm access checks. (mvg) https://github.com/maxmind/GeoIP2-java/pull/52
- [x] Ingest forks a thread for each bulk item in a bulk request. Instead ingest should use one thread to process an entire bulk request. (mvg) #15593 
- [x] Ingest should only try the load the pipelines if the `.ingest` index has been started. (mvg) #15203 
- [x]  Cut over to a non-guice structure. We should at most bind one class to an instance directly rather thatn use the dep-injection framework that we have to un-do once we get rid of juice. #15203 (mvg)
- [x] Change `pipeline_id`param name to `pipeline`. #15618 
- [x] Add index template for the .ingest index, which should be installed by default. #15001 #15631
- [x] Move ingest infrastructure to core
- [x] Move processors with no dependencies to core
- [x] Make grok a module as it has external deps, but it will be installed by default
- [x] Make geoip a plugin which needs to be installed manually
- [x] If node ingest has been disabled then it should redirect a bulk/index request that has the ingest parameter to a node that has ingest enabled.
- [x] Add more descriptive error messages to pipeline factory exceptions (tal) #16010
- [x] In addition to the isolated processor unit tests we should also test combination of processors. #15247
- [x] Benchmark ingest plugin. #14425 (mvg&tal) 
- [x] Move DedotProcessor into its own plugin #16322 
- [x] Add processor `tag`s to `on_failure` metadata #16202 
- [x] Documentation before release #16009

possible v2 tasks:
- Make it possible to let the ingest plugin know what pipeline to use via index settings / index template.
- configuration management
- _simulate w/ verbose should show document diffs between processor results #14698 
- Composite pipelines. It would be convenient to pre-define certain pipelines that process specific things that can be re-used for other documents. For example, there may be a pipeline that processes date + geoip, and these operate on fields that are common to other documents that may require further processing.
- Add ability to show stats in _simulate response to show resource usage and execution times of pipelines/processors
- A pipeline should be able to choose what (custom) thread it uses. Some pipelines just do some simple modifications to the incoming documents while other may reach out to external systems to enrich the incoming documents. #14616 
- Grok discover  api #15041
- Add notion of transactions for processors which mutate multiple fields within a document. This will allow `on_failure` processors to receive a document with a pre-failed-processor state (ref: https://github.com/elastic/elasticsearch/issues/14548#issuecomment-161799133)
- Add compare processor. #14647
- Add json processor, which converts a json string into json.
- Add kv processor.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest Node - enrich data as it gets in via pipelines #14049

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ingest Node - enrich data as it gets in via pipelines #14049

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions