Continuous transform with date histogram generates duplicates after index rollover

**Elasticsearch version** (`bin/elasticsearch --version`): 7.9.0

**JVM version** (`java -version`): bundled 14.0.1

**Description of the problem including expected versus actual behavior**:

In an ongoing project, we are using continuous transforms with `date_histogram` in pivot for metrics aggregation from raw data over time. We also use ILM policies for target index rollover and old metrics data removal.

We noticed that duplicate records for the same timestamp and same values of other pivot 'dimensions' are generated by transform after target index rollover. This behaviour is incorrect, no such duplicates should be generated.

Logs and source code analysis shows that:

1. Transform implementation causes `date_histogram` to produce buckets for an interval intersecting right bound of time range processed by current checkpoint. This causes transform to produce incomplete aggregates for this interval, as source documents are filtered based on time range. Such 'incomplete' records are then inserted into target index.

2. Transform implementation also rounds down the left bound of processed time range to the nearest `date_histogram` interval when computing aggregates. This causes 'incomplete' records produced by one checkpoint to be overwritten by 'complete' ones generated by the next checkpoint, leading to multiple upserts per checkpoint.

3. When index rollover occurs between two checkpoints, these upserts become inserts into newly created index (via write alias), leading to data duplication.

**Steps to reproduce**:

1. Setup Elasticsearch artifacts as described in [setup_artifacts.txt](https://github.com/elastic/elasticsearch/files/5130448/setup_artifacts.txt)

2. Extract provided [data_generator.zip](https://github.com/elastic/elasticsearch/files/5130453/data_generator.zip) and run data generation script (requires Python 3 and [elasticsearch-py](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html)):
`python3 data_generator.py`

3. Create index patterns `raw_data-*` and `agg_data-*` in Kibana for easy visualization. Switch to `Discover` view and select `agg_data-*` index pattern.

4. Wait for index rollover and subsequent transform execution / checkpoint. Notice significant increase in number of documents in one `1m` interval near rollover timestamp (it will be shifted to the past due to transform frequency and delay).
![01_count_anomaly](https://user-images.githubusercontent.com/30018591/91316616-4e7cf700-e7b9-11ea-938e-5c8ee8f5fa56.png)

5. Explore `1m` interval with increased document count. Notice that nearly all `{@timestamp, x, y}` combinations occur twice in this interval, with different `_index` field values. Notice that documents created in newer index will always have `count` greater or equal to duplicated document in older index due to upserts (as mentioned in problem description).
![02_record_duplicates](https://user-images.githubusercontent.com/30018591/91316645-5b014f80-e7b9-11ea-969a-59ff8b65d430.png)

6. For more in-depth analysis, enable `trace` logging for `org.elasticsearch.xpack.transform`. Capture and examine logs of transform execution around index rollover. Notice how generated queries first produce 'incomplete' records and then attempt to overwrite them with complete data during next execution. Cross-examine logs with source code.

**Proposed solution**:

My approach would be to eliminate 'incomplete' records from the whole process. This should solve duplication issue and also prevent upserts that currently occur for every checkpoint. One way to achieve this is to ensure that the right bound of time range processed within checkpoint is rounded down to nearest `date_histogram` interval.

As an experiment, I have implemented a custom `CheckpointProvider` that produces checkpoints aligned to given interval. I have modified `TransformService` so that this `IntervalBasedCheckpointProvider` is instantiated whenever `date_histogram` is present in pivot definition, using histogram's interval. Current simplified implementation supports only fixed intervals.

Initial results are encouraging - no duplicates are generated when running reproduction scenario. It also seems to work correctly for our project's metrics aggregation transforms. However, adding a new `CheckpointProvider` seems somewhat excessive / not entirely correct. Maybe introducing fine-grained modifications in query building logic of `TransformIndexer` and related classes would be a better idea?

Since my experience with Elasticsearch source code is limited, any feedback / suggestions would be greatly appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous transform with date histogram generates duplicates after index rollover #61587

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Continuous transform with date histogram generates duplicates after index rollover #61587

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions