Partition rate query using tsid prefixes#144818
Merged
dnhatn merged 11 commits intoelastic:mainfrom Mar 26, 2026
Merged
Conversation
dnhatn
commented
Mar 24, 2026
| } | ||
| } | ||
|
|
||
| List<List<PartialLeafReaderContext>> partition(List<LeafReaderContext> leaves, int docsPerSlice) throws IOException { |
Collaborator
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Member
Author
|
@kkrik-es I think there is a bug in the combine partitions that can drop some slices - didn't figure it out for a while (tests didn't catch it). I think the win should be much smaller (and more realistic). I am running the benchmark again. |
Member
Author
|
Buildkite benchmark this with tsdb-metricsgen-270m please |
kkrik-es
reviewed
Mar 24, 2026
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java
Outdated
Show resolved
Hide resolved
kkrik-es
reviewed
Mar 24, 2026
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java
Outdated
Show resolved
Hide resolved
kkrik-es
reviewed
Mar 24, 2026
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java
Outdated
Show resolved
Hide resolved
kkrik-es
reviewed
Mar 24, 2026
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java
Outdated
Show resolved
Hide resolved
kkrik-es
reviewed
Mar 24, 2026
...ugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/TimeSeriesIT.java
Outdated
Show resolved
Hide resolved
Member
|
Hm results show very modest wins.. Did the change apply? |
Member
Author
|
Buildkite benchmark this with tsdb-metricsgen-270m please |
Collaborator
💚 Build Succeeded
This build ran two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR. History
|
Member
Author
|
Thanks Kostas! |
seanzatzdev
pushed a commit
to seanzatzdev/elasticsearch
that referenced
this pull request
Mar 26, 2026
This change wires the prefix partitions introduced in elastic#144617 to the compute engine. Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads. With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs). When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior
seanzatzdev
pushed a commit
to seanzatzdev/elasticsearch
that referenced
this pull request
Mar 27, 2026
This change wires the prefix partitions introduced in elastic#144617 to the compute engine. Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads. With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs). When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior
mamazzol
pushed a commit
to mamazzol/elasticsearch
that referenced
this pull request
Mar 30, 2026
This change wires the prefix partitions introduced in elastic#144617 to the compute engine. Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads. With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs). When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change wires the prefix partitions introduced in #144617 to the compute engine.
Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads.
With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs).
When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior.