Skip to content

Write prefix partition for tsid in tsdb codec#144617

Merged
dnhatn merged 9 commits intoelastic:mainfrom
dnhatn:codec-write-tsid-prefixes
Mar 23, 2026
Merged

Write prefix partition for tsid in tsdb codec#144617
dnhatn merged 9 commits intoelastic:mainfrom
dnhatn:codec-write-tsid-prefixes

Conversation

@dnhatn
Copy link
Copy Markdown
Member

@dnhatn dnhatn commented Mar 20, 2026

Follow-up to #143955, which introduced a single-byte metric prefix in the tsid layout.

This PR writes prefix partition metadata for the _tsid field. The _tsid field is grouped by its first 2 bytes - the metric prefix byte (byte-0) plus one random byte (byte-1) - yielding up to 256 partitions per metric. The partition records the starting document for each prefix group, allowing the query engine to slice data so that each slice contains only time-series sharing the same prefix.

This enables ESQL to partition work across slices without splitting any individual time-series - a requirement for aggregations like rate. This should reduce memory usage and improve performance compared to time-interval partitioning, which requires multiple queries over fragmented data.

The compute engine is not wired up yet, so no improvements are expected yet, but this change may cause a small regression in indexing throughput and storage overhead, which is expected to be trivial.

Relates #143955

@dnhatn dnhatn added :StorageEngine/TSDB You know, for Metrics :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL :StorageEngine/Codec >non-issue labels Mar 20, 2026
@dnhatn dnhatn added the test-release Trigger CI checks against release build label Mar 20, 2026
@dnhatn dnhatn requested a review from kkrik-es March 20, 2026 16:51
Copy link
Copy Markdown
Member

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if @martijnvg can also take a look.

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 20, 2026

It'd be nice if @martijnvg can also take a look.

++ we should wait for a review from Martijn!

@dnhatn dnhatn marked this pull request as ready for review March 21, 2026 17:14
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 21, 2026

tsdb-metricsgen-270m - 256 partitions

@elastic elastic deleted a comment from elasticmachine Mar 21, 2026
@dnhatn dnhatn removed the test-release Trigger CI checks against release build label Mar 21, 2026
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 22, 2026

@kkrik-es I increased the number of partitions from 256 to 1024 (from 16 bits to 18 bits), with a little more overhead but much greater benefit during query. Can you take another look?

@dnhatn dnhatn requested a review from kkrik-es March 22, 2026 23:42
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 23, 2026

tsdb-metricsgen-270m - 1024 partitions

@elastic elastic deleted a comment from elasticmachine Mar 23, 2026
@kkrik-es
Copy link
Copy Markdown
Member

@kkrik-es I increased the number of partitions from 256 to 1024 (from 16 bits to 18 bits), with a little more overhead but much greater benefit during query. Can you take another look?

Looks good, let's also update the description accordingly.

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit, LGTM 👍

static final int VERSION_BINARY_DV_COMPRESSION = 1;
static final int VERSION_NUMERIC_LARGE_BLOCKS = 2;
static final int VERSION_CURRENT = VERSION_NUMERIC_LARGE_BLOCKS;
static final int VERSION_PREFIX_PARTITIONS = 3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version 3 is kind of in ES819Version3TSDBDocValuesFormat, maybe use value 4 for the prefix partitioning? And maybe add that version 3 is part of that new format.

In hindsight ES819Version3TSDBDocValuesFormat wasn't needed (incorrect serverless upgrading made me think we had to introduce this), but it also doesn't have its own codec versioning scheme. This isn't idea, but hopefully we can start with new versioning scheme when ES94TSDBDocValuesFormat is introduced.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I pushed 6bbf4a8

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 23, 2026

@kkrik-es @martijnvg Thank you for the reviews!

@dnhatn dnhatn merged commit 5658fce into elastic:main Mar 23, 2026
29 of 36 checks passed
@dnhatn dnhatn deleted the codec-write-tsid-prefixes branch March 23, 2026 20:17
salvatore-campagna added a commit to salvatore-campagna/elasticsearch that referenced this pull request Mar 24, 2026
Move prefix partition read/write logic into abstract base classes:
- Add PrefixPartitionedEntry, PartitionedDocValues on BaseSortedDocValues
- Update readSorted/readSortedSet for version-gated partition metadata
- Update doAddSortedField/addTermsDict for PrefixedPartitionsWriter
- Add writePrefixPartitions to TSDBDocValuesFormatConfig
- Make PrefixedPartitionsReader/Writer public for cross-package access
- Fix test type references (ES819BinaryDocValues -> TSDBBinaryDocValues)
dnhatn added a commit that referenced this pull request Mar 26, 2026
This change wires the prefix partitions introduced in #144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 26, 2026
This change wires the prefix partitions introduced in elastic#144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
This change wires the prefix partitions introduced in elastic#144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
This change wires the prefix partitions introduced in elastic#144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants