Change stream.* fields to dataset.* fields

For the [new indexing strategy](https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy) currently the fields used are `stream.type`, `stream.dataset`, `stream.namespace`. Over the last weeks it showed that these fields might not be optimal so the proposal is to change it to `dataset.type`, `dataset.name`, `dataset.namespace`. 

Note: This issue is in the package registry as at the moment the registry enforces these fields and public but it will have many other places that need update if we move forward with this.

# What is the problem with stream.* fields?

* stream is Agent specific: The name `stream.*` came initially out of building the Elastic Agent configuration as there we have inputs with streams, and each stream goes to a single dataset. But anyone can use the new indexing strategy so it should not be tied to a specific technology.
* It is more than a stream: Proposed values for `stream.type` also can be content which is not necessarily a stream. See also https://github.com/elastic/ecs/pull/845
* Talking about dataset as a whole: When talking about the indexing strategy I realised I often talk about the dataset name and all of it as one dataset. A dataset is a set of data which belongs together. It is uniquely defined by the type, name and namespace. Having `logs-nginx.access-default` and `logs-nginx.access-prod` are two different datasets.

Based on the above I came to the conclusion that `dataset` should be an object and used for the indexing strategy fields.

One alternative that was discussed is using `datastream` instead as each `dataset` is stored in a datastream. But not each datastream is a dataset per this definition and it would attach it again to a specific technology implementation.

The other alternative discussed was using existing ECS fields like `event.kind` and `event.dataset` but as the types are different (constant_keyword), this does not work and we will be even more strict on names than currently in these fields. But the idea is that these fields will be closely linked on possible values.

# Benefits of dataset.*

Using `dataset.*` also solves some existing problems:

* `stream.*` conflicts with an existing docker input field in Filebeat which is a keyword
* Decoupling of `input.type` in the Elastic Agent config from `dataset.type`. Even if the `input.type` is log, the `dataset.type` could be `metrics` if the log file contains metrics.
* Removes confusion inside the agent config between `stream` and `streams`.

# Changes needed

Places to change current stream.* implementation:

* Elastic Agent field enrichment
* Endpoint binary field enrichment
* Package registry field validation
* Package registry base package with templates
* Integrations repository field addition
* Integrations repository modules export script for dashboard filters
* Indexing strategy docs

This change will likely have no impact on the UI side.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change stream.* fields to dataset.* fields #482

What is the problem with stream.* fields?

Benefits of dataset.*

Changes needed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Change stream.* fields to dataset.* fields #482

Description

What is the problem with stream.* fields?

Benefits of dataset.*

Changes needed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions