Skip to content

Introduce single prefix byte TSID layout#143955

Merged
dnhatn merged 21 commits intoelastic:mainfrom
dnhatn:tsid-for-otel
Mar 19, 2026
Merged

Introduce single prefix byte TSID layout#143955
dnhatn merged 21 commits intoelastic:mainfrom
dnhatn:tsid-for-otel

Conversation

@dnhatn
Copy link
Copy Markdown
Member

@dnhatn dnhatn commented Mar 10, 2026

The current TSID layout:

byte 0 = hash(dimension_names)
bytes 1–4 = hash(value_0) … hash(value_3)
bytes 5–20 = 128-bit hash (uniqueness)

The first 5 bytes are derived from dimension names and a subset of values. Since these are often static for a given metric, partitioning by prefix typically yields only a single partition.

This change introduces a 16-byte TSID layout with a single prefix byte to better partition time-series of a single metric:

byte 0 = hash(metric_name) for OTel,
         hash(labels.__name__) for Prometheus,
         hash(all dimension names and values) for generic TSDB
bytes 1–15 = 15-byte hash (uniqueness + within-metric ordering)
---
Total: 16 bytes (2 longs)

The new layout groups time-series by metric, enabling partitioned rate aggregations across at least 256 partitions.


I've been working on partitioning time series using TSID prefix bytes to enable parallel rate aggregation. The current layout is:

byte0      = hash(dimension_names)
byte1      = hash(value_0)
byte2      = hash(value_1)
byte3      = hash(value_2)
byte4      = hash(value_3)
bytes 5–20 = 128-bit hash of all dimension names and values (uniqueness)
---
Total: 21 bytes (1 name byte + up to 4 value bytes + 16 hash bytes).

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value. I explored several alternative layouts - with and without SimHash, various bit allocations (3+3+10, 8+8, pure SimHash, clustering bytes) - all leading to either storage or query regressions.

After profiling the doc access patterns, the root cause was: metrics interleaving. When different metrics interleave in sort order, queries must skip over irrelevant documents, increasing traversal cost. The key insight is that reserving a full byte for _metric_names_hash ensures all time series of the same metric are grouped contiguously - no interleaving. This improves compression and query performance since each query accesses exactly one contiguous slice of the segment.

This change introduces a specialized 16-byte tsid layout for OTel schemas (if the first dimension is _metric_names_hash):

byte0      = hash(_metric_names_hash value) — separates metric types
bytes 1–15 = 120-bit hash of all dimensions — uniqueness and within-metric ordering
---
Total: 16 bytes (2 longs).

Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.

This builds on Felix's work in #133706, bringing the OTel tsid down to a fixed 16 bytes, enabling future optimizations for hashing.

Labelling this as a non-issue since the changes are gated by a feature flag.

@dnhatn dnhatn changed the title Specialized TSID layout for OTel schemas Specialized tsid layout for OTel schemas Mar 10, 2026
@dnhatn dnhatn changed the title Specialized tsid layout for OTel schemas Specialized tsid layout for otel schemas Mar 10, 2026
@felixbarny
Copy link
Copy Markdown
Member

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value.

I wonder if this changes once a data stream has more variety of data. But maybe the _metric_names_hash good enough for those cases as well.

We should probably also do a similar thing for Prometheus data using labels.__name__.

@elastic elastic deleted a comment from elasticmachine Mar 11, 2026
@elastic elastic deleted a comment from elasticmachine Mar 11, 2026
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 12, 2026

tsdb-metricsgen-270m benchmarks

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 12, 2026

tsdb benchmarks

@elastic elastic deleted a comment from elasticmachine Mar 12, 2026
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 15, 2026

tsdb-metricsgen-270m benchmarks

@elastic elastic deleted a comment from elasticmachine Mar 15, 2026
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 15, 2026

@martijnvg @kkrik-es @felixbarny I am not sure we can change the TSID layout of a data stream where some old backing indices have the old TSID layout. Querying across backing indices with old and new layouts could return incorrect results - time-series boundaries in old indices won't match those in the new index. Only the first/last aggregation buckets at the index boundary will be affected. We could add an index-version marker to the data stream, but existing data streams would never get the benefit. I am not sure if we should accept the broken boundary aggregation buckets.

@kkrik-es
Copy link
Copy Markdown
Member

This should only affect unwrapped time series aggs without dimension bucketing - no common but possible..

I'm on the fence here. What kind of wins are we seeing? I wonder if we can get close with the backup plan of slicing by time.

Copy link
Copy Markdown
Member

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promising.

@dnhatn dnhatn marked this pull request as ready for review March 19, 2026 04:06
@dnhatn dnhatn requested a review from a team as a code owner March 19, 2026 04:06
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, LGTM 👍

}

public static boolean useSingleBytePrefixLayout(IndexVersion indexVersion) {
return SINGLE_PREFIX_BYTE_ENABLED && indexVersion.onOrAfter(IndexVersions.TSID_SINGLE_PREFIX_BYTE_FEATURE_FLAG);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PR description:

Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.

Do we still want to enforce this? Or is the PR description stale? I do see that computeSingleBytePrefix(...) is adaptive if it can't find specific dimensions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be stale, from the first iteration. Nhat will confirm, should work across the board now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've updated the PR title and description.

@dnhatn dnhatn changed the title Specialized tsid layout for otel schemas Introduce single prefix byte TSID layout Mar 19, 2026
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 19, 2026

@felixbarny @kkrik-es @martijnvg Thank you so much for the feedback and review.

@dnhatn dnhatn enabled auto-merge (squash) March 19, 2026 15:50
@felixbarny
Copy link
Copy Markdown
Member

Let us know once you have some concrete numbers on the improvements this unlocks. I'm really curious 🙂

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 19, 2026

@felixbarny I hope there are no bugs in the prototype. The results so far are promising: Benchmark results

@dnhatn dnhatn merged commit 9846b8e into elastic:main Mar 19, 2026
36 checks passed
@dnhatn dnhatn deleted the tsid-for-otel branch March 19, 2026 18:51
@felixbarny
Copy link
Copy Markdown
Member

Wow, this seems like a massive difference 😮
This doesn't just come from this change alone, does it?

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 19, 2026

@felixbarny We will need two more PRs.

@JVerwolf
Copy link
Copy Markdown
Contributor

This seems to be related: #144678

michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
The current TSID layout:

```
byte 0 = hash(dimension_names)
bytes 1–4 = hash(value_0) … hash(value_3)
bytes 5–20 = 128-bit hash (uniqueness)
```

The first 5 bytes are derived from dimension names and a subset of 
values. Since these are often static for a given metric, partitioning by
prefix typically yields only a single partition.

This change introduces a 16-byte TSID layout with a single prefix byte 
to better partition time-series of a single metric:

```
byte 0 = hash(metric_name) for OTel,
         hash(labels.__name__) for Prometheus,
         hash(all dimension names and values) for generic TSDB
bytes 1–15 = 15-byte hash (uniqueness + within-metric ordering)
---
Total: 16 bytes (2 longs)
```

The new layout groups time-series by metric, enabling partitioned rate 
aggregations across at least 256 partitions.
dnhatn added a commit that referenced this pull request Mar 23, 2026
Follow-up to #143955, which introduced a single-byte metric prefix in 
the tsid layout.

This PR writes prefix partition metadata for the _tsid field. The _tsid 
field is grouped by its first 2 bytes - the metric prefix byte (byte-0)
plus one random byte (byte-1) - yielding up to 256 partitions per
metric. The partition records the starting document for each prefix
group, allowing the query engine to slice data so that each slice
contains only time-series sharing the same prefix.

This enables ESQL to partition work across slices without splitting any 
individual time-series - a requirement for aggregations like rate. This
should reduce memory usage and improve performance compared to
time-interval partitioning, which requires multiple queries over
fragmented data.

The compute engine is not wired up yet, so no improvements are expected 
yet, but this change may cause a small regression in indexing throughput
and storage overhead, which is expected to be trivial.

Relates #143955
dnhatn added a commit that referenced this pull request Apr 4, 2026
This change enables the new layout - single prefix byte for tsid in release builds.

Relates #143955
mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026
This change enables the new layout - single prefix byte for tsid in release builds.

Relates elastic#143955
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants