Skip to content

Missing metric name in tsid results in TSDB duplicate detection dropping data #99123

@salvatore-campagna

Description

@salvatore-campagna

Elasticsearch Version

8.7 and above

Installed Plugins

No response

Java Version

bundled

OS Version

All

Problem Description

This issue has been observed while integrating Prometheus remotes writes into time series database. It happens as a result of batching happening before metrics are written to TSDB. If we have two (consecutive) batches (possibly with other batches in between) writing the second batch might fail due to TSDB duplicate detection in case the first one had the same set of dimensions (names and values)and timestamp. This happens because of the way we generate "the primary key" for documents stored in TSDB.

Each document is identified by a pair (_tsid, timestamp) so clients can generate batches that include metrics with the same _tsid and timestamp. Note also that clients might do pre-aggregation of metrics on the timestamp field which might result in more chances of duplications because documents are generated at specific points in time.

Example

Batch 1

...
{"@timestamp": "2021-04-28T18:00:00.000Z" , "region": "us-east-2", "host": "foo", "pod": "fb58e236-48af-11ee-be56-0242ac120002", "gauge": 10} 
...

Batch 2

...
{"@timestamp": "2021-04-28T18:00:00.000Z" , "region": "us-east-2", "host": "foo", "pod": "fb58e236-48af-11ee-be56-0242ac120002", "counter": 10} 
...

Here we have two metrics, gauge and counter whose dimensions (names and values) are the same.

The _tsid for the first batch is the same as the _tsid of the second batch, something similar to:

region=us-east2:host=foo:pod=fb58e236-48af-11ee-be56-0242ac120002

Also the timestamp is the same.

The result is that duplicate detection kicks in, rejecting the second batch as a duplicate.

Integrations are working-around this issue by adding the metric name as an additional "dummy" dimension field. In that case the gauge and counter metric have a different name, which results in a different _tsid which results in writing two documents with different _tsid (and same timestamp). Adding this additional dimension (metric name), anyway, results in two documents being stored instead of one resulting in tsdb storage overhead.

This is considered a bug since we reject data as a result of incorrect duplicate detection. The two batches include data about two different metrics even if they share the same set of dimensions (_tsid). The contract of metric protocols, such as Prometheus remote write or OTLP metrics is that the metric name is part of the identity of a time series.

Steps to Reproduce

Just try to index two documents with the same dimensions, each with a different set of metrics, for instance one with a counter and one with a gauge.

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions