Write prefix partition for tsid in tsdb codec by dnhatn · Pull Request #144617 · elastic/elasticsearch

dnhatn · 2026-03-20T04:12:20Z

Follow-up to #143955, which introduced a single-byte metric prefix in the tsid layout.

This PR writes prefix partition metadata for the _tsid field. The _tsid field is grouped by its first 2 bytes - the metric prefix byte (byte-0) plus one random byte (byte-1) - yielding up to 256 partitions per metric. The partition records the starting document for each prefix group, allowing the query engine to slice data so that each slice contains only time-series sharing the same prefix.

This enables ESQL to partition work across slices without splitting any individual time-series - a requirement for aggregations like rate. This should reduce memory usage and improve performance compared to time-interval partitioning, which requires multiple queries over fragmented data.

The compute engine is not wired up yet, so no improvements are expected yet, but this change may cause a small regression in indexing throughput and storage overhead, which is expected to be trivial.

Relates #143955

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java

kkrik-es

It'd be nice if @martijnvg can also take a look.

dnhatn · 2026-03-20T16:54:21Z

It'd be nice if @martijnvg can also take a look.

++ we should wait for a review from Martijn!

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java

…fixes

elasticsearchmachine · 2026-03-21T17:14:44Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

dnhatn · 2026-03-21T17:15:39Z

tsdb-metricsgen-270m - 256 partitions

Buildkite Build
Commit: 64e4dc2
Baseline: e6d0bd2 (env ID 992b1e7b-ed6b-4067-8650-7ab33b252460)
Contender: 64e4dc2 (env ID 62368d1e-ef74-40d4-9ef6-bb61d9a87378)
Benchmark results

…fixes

dnhatn · 2026-03-22T23:42:20Z

@kkrik-es I increased the number of partitions from 256 to 1024 (from 16 bits to 18 bits), with a little more overhead but much greater benefit during query. Can you take another look?

dnhatn · 2026-03-23T05:34:08Z

tsdb-metricsgen-270m - 1024 partitions

Buildkite Build
Commit: 599dbdf
Baseline: 1b581a2 (env ID 89971078-dae9-4575-a990-d719b2eda9ba)
Contender: 599dbdf (env ID 355ed2f8-c601-4530-9952-8670b3dda353)
Benchmark results

kkrik-es · 2026-03-23T06:52:24Z

@kkrik-es I increased the number of partitions from 256 to 1024 (from 16 bits to 18 bits), with a little more overhead but much greater benefit during query. Can you take another look?

Looks good, let's also update the description accordingly.

martijnvg

One minor nit, LGTM 👍

martijnvg · 2026-03-23T10:54:38Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesFormat.java

    static final int VERSION_BINARY_DV_COMPRESSION = 1;
    static final int VERSION_NUMERIC_LARGE_BLOCKS = 2;
-    static final int VERSION_CURRENT = VERSION_NUMERIC_LARGE_BLOCKS;
+    static final int VERSION_PREFIX_PARTITIONS = 3;


Version 3 is kind of in ES819Version3TSDBDocValuesFormat, maybe use value 4 for the prefix partitioning? And maybe add that version 3 is part of that new format.

In hindsight ES819Version3TSDBDocValuesFormat wasn't needed (incorrect serverless upgrading made me think we had to introduce this), but it also doesn't have its own codec versioning scheme. This isn't idea, but hopefully we can start with new versioning scheme when ES94TSDBDocValuesFormat is introduced.

yes, I pushed 6bbf4a8

…fixes

dnhatn · 2026-03-23T20:17:41Z

@kkrik-es @martijnvg Thank you for the reviews!

Move prefix partition read/write logic into abstract base classes: - Add PrefixPartitionedEntry, PartitionedDocValues on BaseSortedDocValues - Update readSorted/readSortedSet for version-gated partition metadata - Update doAddSortedField/addTermsDict for PrefixedPartitionsWriter - Add writePrefixPartitions to TSDBDocValuesFormatConfig - Make PrefixedPartitionsReader/Writer public for cross-package access - Fix test type references (ES819BinaryDocValues -> TSDBBinaryDocValues)

This change wires the prefix partitions introduced in #144617 to the compute engine. Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads. With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs). When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior

This change wires the prefix partitions introduced in elastic#144617 to the compute engine. Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads. With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs). When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior

Write prefix partition for tsid

64e4dc2

elasticsearchmachine added the v9.4.0 label Mar 20, 2026

dnhatn added :StorageEngine/TSDB You know, for Metrics :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL :StorageEngine/Codec >non-issue labels Mar 20, 2026

dnhatn requested review from felixbarny, kkrik-es and martijnvg March 20, 2026 04:43

fix tests

e5be91a

dnhatn added the test-release Trigger CI checks against release build label Mar 20, 2026

kkrik-es reviewed Mar 20, 2026

View reviewed changes

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java Show resolved Hide resolved

dnhatn requested a review from kkrik-es March 20, 2026 16:51

kkrik-es approved these changes Mar 20, 2026

View reviewed changes

kkrik-es reviewed Mar 20, 2026

View reviewed changes

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java Show resolved Hide resolved

Merge remote-tracking branch 'elastic/main' into codec-write-tsid-pre…

85a68a6

…fixes

dnhatn marked this pull request as ready for review March 21, 2026 17:14

elasticsearchmachine added the Team:StorageEngine label Mar 21, 2026

elastic deleted a comment from elasticmachine Mar 21, 2026

fix tests

cb79d7f

dnhatn removed the test-release Trigger CI checks against release build label Mar 21, 2026

dnhatn added 3 commits March 21, 2026 21:40

Merge remote-tracking branch 'elastic/main' into codec-write-tsid-pre…

25d4fa5

…fixes

1024 partitions

a6e5772

Merge remote-tracking branch 'elastic/main' into codec-write-tsid-pre…

599dbdf

…fixes

dnhatn requested a review from kkrik-es March 22, 2026 23:42

elastic deleted a comment from elasticmachine Mar 23, 2026

martijnvg approved these changes Mar 23, 2026

View reviewed changes

dnhatn added 2 commits March 23, 2026 09:46

version 4

6bbf4a8

Merge remote-tracking branch 'elastic/main' into codec-write-tsid-pre…

1b762aa

…fixes

dnhatn merged commit 5658fce into elastic:main Mar 23, 2026
29 of 36 checks passed

dnhatn deleted the codec-write-tsid-prefixes branch March 23, 2026 20:17

dnhatn mentioned this pull request Mar 24, 2026

Partition rate query using tsid prefixes #144818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write prefix partition for tsid in tsdb codec#144617

Write prefix partition for tsid in tsdb codec#144617
dnhatn merged 9 commits intoelastic:mainfrom
dnhatn:codec-write-tsid-prefixes

dnhatn commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

kkrik-es left a comment

Uh oh!

dnhatn commented Mar 20, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 21, 2026

Uh oh!

dnhatn commented Mar 21, 2026 •

edited

Loading

Uh oh!

dnhatn commented Mar 22, 2026

Uh oh!

dnhatn commented Mar 23, 2026

Uh oh!

kkrik-es commented Mar 23, 2026

Uh oh!

martijnvg left a comment

Uh oh!

martijnvg Mar 23, 2026

Uh oh!

dnhatn Mar 23, 2026

Uh oh!

dnhatn commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dnhatn commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kkrik-es left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Mar 20, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 21, 2026

Uh oh!

dnhatn commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tsdb-metricsgen-270m - 256 partitions

Uh oh!

dnhatn commented Mar 22, 2026

Uh oh!

dnhatn commented Mar 23, 2026

tsdb-metricsgen-270m - 1024 partitions

Uh oh!

kkrik-es commented Mar 23, 2026

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

dnhatn Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dnhatn commented Mar 20, 2026 •

edited

Loading

dnhatn commented Mar 21, 2026 •

edited

Loading