Introduce single prefix byte TSID layout#143955
Conversation
I wonder if this changes once a data stream has more variety of data. But maybe the We should probably also do a similar thing for Prometheus data using |
tsdb-metricsgen-270m benchmarks
|
tsdb benchmarks
|
tsdb-metricsgen-270m benchmarks
|
|
@martijnvg @kkrik-es @felixbarny I am not sure we can change the TSID layout of a data stream where some old backing indices have the old TSID layout. Querying across backing indices with old and new layouts could return incorrect results - time-series boundaries in old indices won't match those in the new index. Only the first/last aggregation buckets at the index boundary will be affected. We could add an index-version marker to the data stream, but existing data streams would never get the benefit. I am not sure if we should accept the broken boundary aggregation buckets. |
|
This should only affect unwrapped time series aggs without dimension bucketing - no common but possible.. I'm on the fence here. What kind of wins are we seeing? I wonder if we can get close with the backup plan of slicing by time. |
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| } | ||
|
|
||
| public static boolean useSingleBytePrefixLayout(IndexVersion indexVersion) { | ||
| return SINGLE_PREFIX_BYTE_ENABLED && indexVersion.onOrAfter(IndexVersions.TSID_SINGLE_PREFIX_BYTE_FEATURE_FLAG); |
There was a problem hiding this comment.
From PR description:
Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.
Do we still want to enforce this? Or is the PR description stale? I do see that computeSingleBytePrefix(...) is adaptive if it can't find specific dimensions.
There was a problem hiding this comment.
Should be stale, from the first iteration. Nhat will confirm, should work across the board now.
There was a problem hiding this comment.
Yes, I've updated the PR title and description.
|
@felixbarny @kkrik-es @martijnvg Thank you so much for the feedback and review. |
|
Let us know once you have some concrete numbers on the improvements this unlocks. I'm really curious 🙂 |
|
@felixbarny I hope there are no bugs in the prototype. The results so far are promising: Benchmark results |
|
Wow, this seems like a massive difference 😮 |
|
@felixbarny We will need two more PRs. |
|
This seems to be related: #144678 |
The current TSID layout:
```
byte 0 = hash(dimension_names)
bytes 1–4 = hash(value_0) … hash(value_3)
bytes 5–20 = 128-bit hash (uniqueness)
```
The first 5 bytes are derived from dimension names and a subset of
values. Since these are often static for a given metric, partitioning by
prefix typically yields only a single partition.
This change introduces a 16-byte TSID layout with a single prefix byte
to better partition time-series of a single metric:
```
byte 0 = hash(metric_name) for OTel,
hash(labels.__name__) for Prometheus,
hash(all dimension names and values) for generic TSDB
bytes 1–15 = 15-byte hash (uniqueness + within-metric ordering)
---
Total: 16 bytes (2 longs)
```
The new layout groups time-series by metric, enabling partitioned rate
aggregations across at least 256 partitions.
Follow-up to #143955, which introduced a single-byte metric prefix in the tsid layout. This PR writes prefix partition metadata for the _tsid field. The _tsid field is grouped by its first 2 bytes - the metric prefix byte (byte-0) plus one random byte (byte-1) - yielding up to 256 partitions per metric. The partition records the starting document for each prefix group, allowing the query engine to slice data so that each slice contains only time-series sharing the same prefix. This enables ESQL to partition work across slices without splitting any individual time-series - a requirement for aggregations like rate. This should reduce memory usage and improve performance compared to time-interval partitioning, which requires multiple queries over fragmented data. The compute engine is not wired up yet, so no improvements are expected yet, but this change may cause a small regression in indexing throughput and storage overhead, which is expected to be trivial. Relates #143955
This change enables the new layout - single prefix byte for tsid in release builds. Relates #143955
This change enables the new layout - single prefix byte for tsid in release builds. Relates elastic#143955
The current TSID layout:
The first 5 bytes are derived from dimension names and a subset of values. Since these are often static for a given metric, partitioning by prefix typically yields only a single partition.
This change introduces a 16-byte TSID layout with a single prefix byte to better partition time-series of a single metric:
The new layout groups time-series by metric, enabling partitioned rate aggregations across at least 256 partitions.
I've been working on partitioning time series using TSID prefix bytes to enable parallel rate aggregation. The current layout is:
Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and
_metric_names_hashvalue. I explored several alternative layouts - with and without SimHash, various bit allocations (3+3+10, 8+8, pure SimHash, clustering bytes) - all leading to either storage or query regressions.After profiling the doc access patterns, the root cause was: metrics interleaving. When different metrics interleave in sort order, queries must skip over irrelevant documents, increasing traversal cost. The key insight is that reserving a full byte for
_metric_names_hashensures all time series of the same metric are grouped contiguously - no interleaving. This improves compression and query performance since each query accesses exactly one contiguous slice of the segment.This change introduces a specialized 16-byte tsid layout for OTel schemas (if the first dimension is
_metric_names_hash):Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.
This builds on Felix's work in #133706, bringing the OTel tsid down to a fixed 16 bytes, enabling future optimizations for hashing.
Labelling this as a non-issue since the changes are gated by a feature flag.