Partition rate query using tsid prefixes by dnhatn · Pull Request #144818 · elastic/elasticsearch

dnhatn · 2026-03-24T00:19:16Z

This change wires the prefix partitions introduced in #144617 to the compute engine.

Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads.

With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs).

When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior.

…rtitions

dnhatn · 2026-03-24T03:57:55Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

+            }
+        }
+
+        List<List<PartialLeafReaderContext>> partition(List<LeafReaderContext> leaves, int docsPerSlice) throws IOException {


This is the main change.

elasticsearchmachine · 2026-03-24T03:58:48Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

dnhatn · 2026-03-24T05:31:14Z

@kkrik-es I think there is a bug in the combine partitions that can drop some slices - didn't figure it out for a while (tests didn't catch it). I think the win should be much smaller (and more realistic). I am running the benchmark again.

dnhatn · 2026-03-24T05:36:25Z

Buildkite benchmark this with tsdb-metricsgen-270m please

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/query/LuceneSliceQueue.java

...ugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/TimeSeriesIT.java

kkrik-es

Well done, Nhat!

kkrik-es · 2026-03-24T12:26:43Z

Hm results show very modest wins.. Did the change apply?

…rtitions

dnhatn · 2026-03-25T18:06:05Z

Buildkite benchmark this with tsdb-metricsgen-270m please

elasticmachine · 2026-03-25T18:11:43Z

💚 Build Succeeded

Buildkite Build
Commit: 8f49d0f
Baseline: e3176c6 (env ID 15e2a306-e383-42e1-90d2-53fd6d844343)
Contender: 8f49d0f (env ID 5b2c9e0c-a597-4537-82af-8f5ae3da153b)
Benchmark results

This build ran two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR.

History

💚 Build #508 succeeded 1b3516f

…rtitions

dnhatn · 2026-03-26T04:01:08Z

Thanks Kostas!

This change wires the prefix partitions introduced in elastic#144617 to the compute engine. Today, we partition the rate query by interval via replacing round_to with query_and_tags. With 10k time-series and a 5-minute bucket, each interval query reads all 10k time-series from every segment. In the rate aggregation, we buffer data points for all 10k time-series and maintain a priority queue across all of them within each interval. This approach increases concurrency to avoid underutilizing CPUs, but adds overhead and is not I/O friendly due to fragmented reads. With prefix partitions, we partition data by groups of contiguous time-series instead. For example, 10k time-series can be split into 1024 groups of ~10 each. Each group reads all matching data points, and because these time-series are co-located in each segment, reads are sequential and I/O friendly. In the rate aggregation, the priority queue manages only ~10 time-series per group instead of 10k, significantly reducing memory usage. To avoid excessive overhead from tiny partitions, we merge adjacent partitions up to a target size (250k docs). When prefix partitioning is not available (e.g., older codec without prefix layout), we fall back to the current behavior

Partition rate query using tsid prefixes

0f1a391

elasticsearchmachine added the v9.4.0 label Mar 24, 2026

Merge remote-tracking branch 'elastic/main' into query-tsid-prefix-pa…

fa571bb

…rtitions

dnhatn added :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL >non-issue labels Mar 24, 2026

dnhatn requested a review from kkrik-es March 24, 2026 03:56

dnhatn commented Mar 24, 2026

View reviewed changes

dnhatn marked this pull request as ready for review March 24, 2026 03:58

elasticsearchmachine added the Team:StorageEngine label Mar 24, 2026

use previous parameters

1b3516f

elastic deleted a comment from elasticmachine Mar 24, 2026