Skip to content

Ingest Node - enrich data as it gets in via pipelines #14049

@martijnvg

Description

@martijnvg

[currently open issues]

related issues from other projects:

There are many use-cases where it is important to enrich incoming data. This enrichment may be something simple like using a regular expression to extract metadata from an existing field, or something more advanced like a geoip lookup or language identification. The filter stage of the Logstash processing pipeline provides great examples of the ways in which data is often enriched. Node ingest implements a new type of ES node, which performs this enrichment prior to indexing.

Node ingest is a pure Java implementation of the filters in logstash, integrated with Elasticsearch. It works by wrapping the bulk/index APIs, executing a pipeline that is composed of multiple processors to enrich the documents. A processor is just a component that can modify incoming documents (the source before ES turns it into a document). A pipeline is a list of processors grouped under an unique id. If node ingest is enabled then the index and bulk apis can reroute the the request with documents through a pipeline.

The ingest plugin runs on dedicated client nodes and after bulk and index requests have been enriched these index and bulk request continue their way into the cluster.

Node ingest will be a plugin in the elasticsearch project, implementing 2 main aspects:

The first is a pure Java implementation for Pipeline, Processor, as well as initial processor implementation of grok, geoip, kv/mutate, date. This java implementation can then be reused in others places, such as logstash itself, reindex API, and so on. In the first version of the ingest plugin the processor implementations can reside in the ingest plugin, but the framework and processor implementations shouldn’t rely on any ES specific code, so that later on it can be moved to an isolated library.

The second part is the integration with Elasticsearch. This includes interception of the bulk/index APIs, management APIs (stats and so on in future phase), storage and live reload of the configuration, supporting multiple "live" pipelines, and simulation of pipeline execution.

The goal of the ingest plugin is to make data enrichment easier and it will not replace logstash at all. The ingest plugin should make data enrichment in most of the cases easier when events are only stored in Elasticsearch. For example when only file beat is used to ship logs, a logstash instance will no longer be required. In cases where events are stored in multiple outputs a Logstash installation is required. Also at some point Logstash will reuse the pipeline/processor framework, so the end goal is that both Elasticsearch and Logstash will benefit from the ingest initiative.

Development happens in a feature branch: https://github.com/elastic/elasticsearch/tree/feature/ingest

Current node ingest tasks:

possible v2 tasks:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions