Skip to content

Partition rate query using tsid prefixes#144818

Merged
dnhatn merged 11 commits intoelastic:mainfrom
dnhatn:query-tsid-prefix-partitions
Mar 26, 2026
Merged

Partition rate query using tsid prefixes#144818
dnhatn merged 11 commits intoelastic:mainfrom
dnhatn:query-tsid-prefix-partitions

Conversation

@dnhatn
Copy link
Copy Markdown
Member

@dnhatn dnhatn commented Mar 24, 2026

This change wires the prefix partitions introduced in #144617 to the compute engine.

Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior.

@dnhatn dnhatn added :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL >non-issue labels Mar 24, 2026
@dnhatn dnhatn requested a review from kkrik-es March 24, 2026 03:56
}
}

List<List<PartialLeafReaderContext>> partition(List<LeafReaderContext> leaves, int docsPerSlice) throws IOException {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change.

@dnhatn dnhatn marked this pull request as ready for review March 24, 2026 03:58
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elastic elastic deleted a comment from elasticmachine Mar 24, 2026
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 24, 2026

@kkrik-es I think there is a bug in the combine partitions that can drop some slices - didn't figure it out for a while (tests didn't catch it). I think the win should be much smaller (and more realistic). I am running the benchmark again.

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 24, 2026

Buildkite benchmark this with tsdb-metricsgen-270m please

Copy link
Copy Markdown
Member

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Nhat!

@kkrik-es
Copy link
Copy Markdown
Member

Hm results show very modest wins.. Did the change apply?

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 25, 2026

Buildkite benchmark this with tsdb-metricsgen-270m please

@elasticmachine
Copy link
Copy Markdown
Collaborator

elasticmachine commented Mar 25, 2026

💚 Build Succeeded

This build ran two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR.

History

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 26, 2026

Thanks Kostas!

@dnhatn dnhatn merged commit 17b16e3 into elastic:main Mar 26, 2026
35 of 36 checks passed
@dnhatn dnhatn deleted the query-tsid-prefix-partitions branch March 26, 2026 04:01
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 26, 2026
This change wires the prefix partitions introduced in elastic#144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
This change wires the prefix partitions introduced in elastic#144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
This change wires the prefix partitions introduced in elastic#144617 to the 
compute engine.

Today, we partition the rate query by interval via replacing round_to 
with query_and_tags. With 10k time-series and a 5-minute bucket, each
interval query reads all 10k time-series from every segment. In the rate
aggregation, we buffer data points for all 10k time-series and maintain
a priority queue across all of them within each interval. This approach
increases concurrency to avoid underutilizing CPUs, but adds overhead
and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous 
time-series instead. For example, 10k time-series can be split into 1024
groups of ~10 each. Each group reads all matching data points, and
because these time-series are co-located in each segment, reads are
sequential and I/O friendly. In the rate aggregation, the priority queue
manages only ~10 time-series per group instead of 10k, significantly
reducing memory usage. To avoid excessive overhead from tiny partitions,
we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without 
prefix layout), we fall back to the current behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL Team:StorageEngine v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants