Introduce single prefix byte TSID layout by dnhatn · Pull Request #143955 · elastic/elasticsearch

dnhatn · 2026-03-10T15:58:25Z

The current TSID layout:

byte 0 = hash(dimension_names)
bytes 1–4 = hash(value_0) … hash(value_3)
bytes 5–20 = 128-bit hash (uniqueness)

The first 5 bytes are derived from dimension names and a subset of values. Since these are often static for a given metric, partitioning by prefix typically yields only a single partition.

This change introduces a 16-byte TSID layout with a single prefix byte to better partition time-series of a single metric:

byte 0 = hash(metric_name) for OTel,
         hash(labels.__name__) for Prometheus,
         hash(all dimension names and values) for generic TSDB
bytes 1–15 = 15-byte hash (uniqueness + within-metric ordering)
---
Total: 16 bytes (2 longs)

The new layout groups time-series by metric, enabling partitioned rate aggregations across at least 256 partitions.

I've been working on partitioning time series using TSID prefix bytes to enable parallel rate aggregation. The current layout is:

byte0      = hash(dimension_names)
byte1      = hash(value_0)
byte2      = hash(value_1)
byte3      = hash(value_2)
byte4      = hash(value_3)
bytes 5–20 = 128-bit hash of all dimension names and values (uniqueness)
---
Total: 21 bytes (1 name byte + up to 4 value bytes + 16 hash bytes).

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value. I explored several alternative layouts - with and without SimHash, various bit allocations (3+3+10, 8+8, pure SimHash, clustering bytes) - all leading to either storage or query regressions.

After profiling the doc access patterns, the root cause was: metrics interleaving. When different metrics interleave in sort order, queries must skip over irrelevant documents, increasing traversal cost. The key insight is that reserving a full byte for _metric_names_hash ensures all time series of the same metric are grouped contiguously - no interleaving. This improves compression and query performance since each query accesses exactly one contiguous slice of the segment.

This change introduces a specialized 16-byte tsid layout for OTel schemas (if the first dimension is _metric_names_hash):

byte0      = hash(_metric_names_hash value) — separates metric types
bytes 1–15 = 120-bit hash of all dimensions — uniqueness and within-metric ordering
---
Total: 16 bytes (2 longs).

Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.

This builds on Felix's work in #133706, bringing the OTel tsid down to a fixed 16 bytes, enabling future optimizations for hashing.

Labelling this as a non-issue since the changes are gated by a feature flag.

felixbarny · 2026-03-10T16:42:48Z

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value.

I wonder if this changes once a data stream has more variety of data. But maybe the _metric_names_hash good enough for those cases as well.

We should probably also do a similar thing for Prometheus data using labels.__name__.

dnhatn · 2026-03-12T03:11:32Z

tsdb-metricsgen-270m benchmarks

Buildkite Build
Commit: 90139b1
Baseline: 4dd805b (env ID 3e04e22a-84f9-448e-8e84-2518f5342bc9)
Contender: 90139b1 (env ID 8c84b43a-8f33-4feb-a790-fa32852b1871)
Benchmark results

dnhatn · 2026-03-12T16:48:05Z

tsdb benchmarks

Buildkite Build
Commit: 90139b1
Baseline: 4dd805b (env ID cdcbf5de-30c0-4b5c-b53f-ec7d2c0d1942)
Contender: 90139b1 (env ID e0c5d255-5839-4450-b0db-73bfbe0bb235)
Benchmark results

dnhatn · 2026-03-15T04:01:14Z

tsdb-metricsgen-270m benchmarks

Buildkite Build
Commit: 82f8e13
Baseline: 4ebb748 (env ID b06bcd65-ed79-4fb4-a0b7-191b339d0407)
Contender: 82f8e13 (env ID e4c4c978-0d1e-4efa-a8ec-b831ba3f3b4e)
Benchmark results

dnhatn · 2026-03-15T05:53:41Z

@martijnvg @kkrik-es @felixbarny I am not sure we can change the TSID layout of a data stream where some old backing indices have the old TSID layout. Querying across backing indices with old and new layouts could return incorrect results - time-series boundaries in old indices won't match those in the new index. Only the first/last aggregation buckets at the index boundary will be affected. We could add an index-version marker to the data stream, but existing data streams would never get the benefit. I am not sure if we should accept the broken boundary aggregation buckets.

kkrik-es · 2026-03-15T12:46:05Z

This should only affect unwrapped time series aggs without dimension bucketing - no common but possible..

I'm on the fence here. What kind of wins are we seeing? I wonder if we can get close with the backup plan of slicing by time.

kkrik-es

Promising.

elasticsearchmachine · 2026-03-19T04:06:31Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg

One question, LGTM 👍

martijnvg · 2026-03-19T08:12:23Z

server/src/main/java/org/elasticsearch/cluster/routing/TsidBuilder.java

+    }
+
+    public static boolean useSingleBytePrefixLayout(IndexVersion indexVersion) {
+        return SINGLE_PREFIX_BYTE_ENABLED && indexVersion.onOrAfter(IndexVersions.TSID_SINGLE_PREFIX_BYTE_FEATURE_FLAG);


From PR description:

Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.

Do we still want to enforce this? Or is the PR description stale? I do see that computeSingleBytePrefix(...) is adaptive if it can't find specific dimensions.

Should be stale, from the first iteration. Nhat will confirm, should work across the board now.

Yes, I've updated the PR title and description.

dnhatn · 2026-03-19T15:49:37Z

@felixbarny @kkrik-es @martijnvg Thank you so much for the feedback and review.

felixbarny · 2026-03-19T17:11:46Z

Let us know once you have some concrete numbers on the improvements this unlocks. I'm really curious 🙂

dnhatn · 2026-03-19T18:39:14Z

@felixbarny I hope there are no bugs in the prototype. The results so far are promising: Benchmark results

felixbarny · 2026-03-19T18:53:06Z

Wow, this seems like a massive difference 😮
This doesn't just come from this change alone, does it?

dnhatn · 2026-03-19T18:56:20Z

@felixbarny We will need two more PRs.

JVerwolf · 2026-03-20T20:34:51Z

This seems to be related: #144678

The current TSID layout: ``` byte 0 = hash(dimension_names) bytes 1–4 = hash(value_0) … hash(value_3) bytes 5–20 = 128-bit hash (uniqueness) ``` The first 5 bytes are derived from dimension names and a subset of values. Since these are often static for a given metric, partitioning by prefix typically yields only a single partition. This change introduces a 16-byte TSID layout with a single prefix byte to better partition time-series of a single metric: ``` byte 0 = hash(metric_name) for OTel, hash(labels.__name__) for Prometheus, hash(all dimension names and values) for generic TSDB bytes 1–15 = 15-byte hash (uniqueness + within-metric ordering) --- Total: 16 bytes (2 longs) ``` The new layout groups time-series by metric, enabling partitioned rate aggregations across at least 256 partitions.

Follow-up to #143955, which introduced a single-byte metric prefix in the tsid layout. This PR writes prefix partition metadata for the _tsid field. The _tsid field is grouped by its first 2 bytes - the metric prefix byte (byte-0) plus one random byte (byte-1) - yielding up to 256 partitions per metric. The partition records the starting document for each prefix group, allowing the query engine to slice data so that each slice contains only time-series sharing the same prefix. This enables ESQL to partition work across slices without splitting any individual time-series - a requirement for aggregations like rate. This should reduce memory usage and improve performance compared to time-interval partitioning, which requires multiple queries over fragmented data. The compute engine is not wired up yet, so no improvements are expected yet, but this change may cause a small regression in indexing throughput and storage overhead, which is expected to be trivial. Relates #143955

This change enables the new layout - single prefix byte for tsid in release builds. Relates #143955

This change enables the new layout - single prefix byte for tsid in release builds. Relates elastic#143955

elasticsearchmachine added the v9.4.0 label Mar 10, 2026

dnhatn changed the title ~~Specialized TSID layout for OTel schemas~~ Specialized tsid layout for OTel schemas Mar 10, 2026

dnhatn changed the title ~~Specialized tsid layout for OTel schemas~~ Specialized tsid layout for otel schemas Mar 10, 2026

dnhatn force-pushed the tsid-for-otel branch from 7fb1ea8 to 3bfefad Compare March 10, 2026 21:04

elastic deleted a comment from elasticmachine Mar 11, 2026

Specialized TSID layout for OTel schemas

4dca928

dnhatn force-pushed the tsid-for-otel branch from 3bfefad to 4dca928 Compare March 11, 2026 01:32

elastic deleted a comment from elasticmachine Mar 11, 2026

dnhatn added 3 commits March 11, 2026 10:07

General clustering byte

7d4a973

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

90139b1

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

e74d73c

elastic deleted a comment from elasticmachine Mar 12, 2026

dnhatn added 8 commits March 12, 2026 14:21

handle bwc

ceb4615

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

091947c

handle bwc

2eb95be

skip otlp

ece1029

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

48a1853

fix sort

82f8e13

fix otlp

f429165

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

89ca77f

elastic deleted a comment from elasticmachine Mar 15, 2026

dnhatn requested review from felixbarny, kkrik-es and martijnvg March 15, 2026 05:53

dnhatn added the >non-issue label Mar 18, 2026

felixbarny mentioned this pull request Mar 18, 2026

Reduce memory overhead of synthetic _ids in TSDB #144418

Open

kkrik-es approved these changes Mar 18, 2026

View reviewed changes

dnhatn added 2 commits March 18, 2026 14:51

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

2d69236

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

704de48

dnhatn marked this pull request as ready for review March 19, 2026 04:06

dnhatn requested a review from a team as a code owner March 19, 2026 04:06

elasticsearchmachine added the Team:StorageEngine label Mar 19, 2026

martijnvg approved these changes Mar 19, 2026

View reviewed changes

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

88ed5e6

dnhatn changed the title ~~Specialized tsid layout for otel schemas~~ Introduce single prefix byte TSID layout Mar 19, 2026

dnhatn enabled auto-merge (squash) March 19, 2026 15:50

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

47746fd

dnhatn merged commit 9846b8e into elastic:main Mar 19, 2026
36 checks passed

dnhatn deleted the tsid-for-otel branch March 19, 2026 18:51

dnhatn mentioned this pull request Mar 20, 2026

Write prefix partition for tsid in tsdb codec #144617

Merged

kkrik-es mentioned this pull request Mar 23, 2026

[TEST] Exclude updated tsid format from testParsedDescriptionWithIndexDimensions #144718

Merged

tlrx mentioned this pull request Mar 25, 2026

[CI] GenerativeIT test failing #144587

Closed

dnhatn mentioned this pull request Apr 3, 2026

Enable single prefix byte tsid in release builds #145705

Merged

dnhatn added a commit that referenced this pull request Apr 4, 2026

Enable single prefix byte tsid in release builds (#145705)

5b90b16

This change enables the new layout - single prefix byte for tsid in release builds. Relates #143955

mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026

Enable single prefix byte tsid in release builds (elastic#145705)

123d995

This change enables the new layout - single prefix byte for tsid in release builds. Relates elastic#143955

Conversation

dnhatn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixbarny commented Mar 10, 2026

Uh oh!

dnhatn commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tsdb-metricsgen-270m benchmarks

Uh oh!

dnhatn commented Mar 12, 2026

tsdb benchmarks

Uh oh!

dnhatn commented Mar 15, 2026

tsdb-metricsgen-270m benchmarks

Uh oh!

dnhatn commented Mar 15, 2026

Uh oh!

kkrik-es commented Mar 15, 2026

Uh oh!

kkrik-es left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Mar 19, 2026

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

kkrik-es Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dnhatn Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Mar 19, 2026

Uh oh!

felixbarny commented Mar 19, 2026

Uh oh!

dnhatn commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

felixbarny commented Mar 19, 2026

Uh oh!

dnhatn commented Mar 19, 2026

Uh oh!

JVerwolf commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dnhatn commented Mar 10, 2026 •

edited

Loading

dnhatn commented Mar 12, 2026 •

edited

Loading

dnhatn commented Mar 19, 2026 •

edited

Loading