Elasticsearch Version
8.7 and above
Installed Plugins
No response
Java Version
bundled
OS Version
All
Problem Description
This issue has been observed while integrating Prometheus remotes writes into time series database. It happens as a result of batching happening before metrics are written to TSDB. If we have two (consecutive) batches (possibly with other batches in between) writing the second batch might fail due to TSDB duplicate detection in case the first one had the same set of dimensions (names and values)and timestamp. This happens because of the way we generate "the primary key" for documents stored in TSDB.
Each document is identified by a pair (_tsid, timestamp) so clients can generate batches that include metrics with the same _tsid and timestamp. Note also that clients might do pre-aggregation of metrics on the timestamp field which might result in more chances of duplications because documents are generated at specific points in time.
Example
Batch 1
...
{"@timestamp": "2021-04-28T18:00:00.000Z" , "region": "us-east-2", "host": "foo", "pod": "fb58e236-48af-11ee-be56-0242ac120002", "gauge": 10}
...
Batch 2
...
{"@timestamp": "2021-04-28T18:00:00.000Z" , "region": "us-east-2", "host": "foo", "pod": "fb58e236-48af-11ee-be56-0242ac120002", "counter": 10}
...
Here we have two metrics, gauge and counter whose dimensions (names and values) are the same.
The _tsid for the first batch is the same as the _tsid of the second batch, something similar to:
region=us-east2:host=foo:pod=fb58e236-48af-11ee-be56-0242ac120002
Also the timestamp is the same.
The result is that duplicate detection kicks in, rejecting the second batch as a duplicate.
Integrations are working-around this issue by adding the metric name as an additional "dummy" dimension field. In that case the gauge and counter metric have a different name, which results in a different _tsid which results in writing two documents with different _tsid (and same timestamp). Adding this additional dimension (metric name), anyway, results in two documents being stored instead of one resulting in tsdb storage overhead.
This is considered a bug since we reject data as a result of incorrect duplicate detection. The two batches include data about two different metrics even if they share the same set of dimensions (_tsid). The contract of metric protocols, such as Prometheus remote write or OTLP metrics is that the metric name is part of the identity of a time series.
Steps to Reproduce
Just try to index two documents with the same dimensions, each with a different set of metrics, for instance one with a counter and one with a gauge.
Logs (if relevant)
No response
Elasticsearch Version
8.7 and above
Installed Plugins
No response
Java Version
bundled
OS Version
All
Problem Description
This issue has been observed while integrating Prometheus remotes writes into time series database. It happens as a result of batching happening before metrics are written to TSDB. If we have two (consecutive) batches (possibly with other batches in between) writing the second batch might fail due to TSDB duplicate detection in case the first one had the same set of dimensions (names and values)and timestamp. This happens because of the way we generate "the primary key" for documents stored in TSDB.
Each document is identified by a pair
(_tsid, timestamp)so clients can generate batches that include metrics with the same _tsid and timestamp. Note also that clients might do pre-aggregation of metrics on the timestamp field which might result in more chances of duplications because documents are generated at specific points in time.Example
Batch 1
Batch 2
Here we have two metrics,
gaugeandcounterwhose dimensions (names and values) are the same.The _tsid for the first batch is the same as the _tsid of the second batch, something similar to:
Also the
timestampis the same.The result is that duplicate detection kicks in, rejecting the second batch as a duplicate.
Integrations are working-around this issue by adding the metric name as an additional "dummy" dimension field. In that case the gauge and counter metric have a different name, which results in a different _tsid which results in writing two documents with different _tsid (and same timestamp). Adding this additional dimension (metric name), anyway, results in two documents being stored instead of one resulting in tsdb storage overhead.
This is considered a bug since we reject data as a result of incorrect duplicate detection. The two batches include data about two different metrics even if they share the same set of dimensions (_tsid). The contract of metric protocols, such as Prometheus remote write or OTLP metrics is that the metric name is part of the identity of a time series.
Steps to Reproduce
Just try to index two documents with the same dimensions, each with a different set of metrics, for instance one with a counter and one with a gauge.
Logs (if relevant)
No response