Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K#98607
Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K#98607murphy-4o wants to merge 47 commits intoClickHouse:masterfrom
Conversation
…gregate LIMIT K Add `TopNAggregatingStep` and `TopNAggregatingTransform` that fuse aggregation, sorting, and limiting into a single pass for queries like: SELECT key, max(ts) AS m FROM t GROUP BY key ORDER BY m DESC LIMIT 5 The optimization replaces the Aggregating → Sorting → Limit plan chain with a single `TopNAggregating` step. Two modes are supported: - Mode 1 (sorted input): when the table's ORDER BY key matches the aggregate argument, the operator requests reverse reading from `ReadFromMergeTree` and stops after K unique groups — O(K) memory, minimal rows read. - Mode 2 (unsorted input): processes all rows, aggregates per group, then partial-sorts to select the top K — eliminates the full sort. Supported aggregates: `min`, `max`, `any`, `argMin`, `argMax`. Each declares its compatibility via a new `getTopKAggregateInfo` virtual method on `IAggregateFunction`. The optimization is gated by the `optimize_topn_aggregation` setting (default off). Made-with: Cursor
1. Block optimization when OFFSET > 0 — the optimizer now checks `LimitStep::getOffset` and bails out, preventing incorrect results for `LIMIT K OFFSET N` queries. 2. Eliminate per-row arena growth in key serialization — replaced `serializeValueIntoArena` (which allocated into the arena on every row) with a pure `SipHash`-128 key. Arena is now only used for aggregate state allocation, bounded by number of groups. 3. Preserve full `SortColumnDescription` — the optimizer now copies the original `nulls_direction` and `collator` from the `SortingStep` instead of rebuilding with defaults. Mode 2's `compareAt` uses the correct `nulls_direction` for NULL ordering. 4. Document mode-2 as intentional v1 simplification — the header comment now states that threshold-based pruning is not yet implemented. 5. Add test coverage for OFFSET (negative case) and NULLS FIRST/LAST (correctness comparison between optimized and unoptimized paths). Made-with: Cursor
1. Collision-safe group keys: Replace hash-only `std::string` key with `UInt128` SipHash + exact column comparison via `unordered_multimap`. On hash match, compares actual key column values against stored groups in `result_columns`, eliminating the theoretical collision risk. 2. Collation-aware compare in Mode 2: `generateMode2` now checks for a `collator` in `SortColumnDescription` and uses `compareAtWithCollation` when present, ensuring correct ordering for string ORDER BY with explicit collation (e.g., COLLATE 'en'). 3. Mode 1 safety for argMin/argMax/any: Add `output_ordered_by_sort_key` flag to `TopKAggregateInfo`. Only min/max set it to true (their output is monotonically related to the sort key). The optimizer now only enables Mode 1 (early termination with sorted input) when this flag is true, preventing incorrect results for argMin/argMax where the output column differs from the sort key. 4. Extended test coverage: - Collation-sensitive ordering (COLLATE 'en') optimized vs reference - Mode 1 EXPLAIN PLAN + correctness with small LIMIT - argMin/argMax ASC/DESC correctness parity - Tie-heavy dataset (100 groups, deterministic values) - Multi-aggregate (max + argMax) correctness - EXPLAIN PLAN for argMin optimization Made-with: Cursor
The `.cpp` was out of sync with the `.h` header due to a file-write tool failure in the previous session. The header declared the new collision-safe API (`hashGroupKey`, `findGroupIndex`, `unordered_multimap<UInt128, size_t, UInt128Hash>`), but the `.cpp` still contained the old `serializeGroupKey` returning `std::string` with hash-only identity. This commit makes `.cpp` consistent with `.h`: - `hashGroupKey`: returns `UInt128` SipHash digest of key columns. - `findGroupIndex`: on hash match, performs exact column-by-column comparison against stored keys in `result_columns` via `compareAt`, eliminating the hash-collision correctness risk. - Both `consumeMode1` and `consumeMode2` now use the new `unordered_multimap::emplace` / `equal_range` API. Made-with: Cursor
|
Workflow [PR], commit [0320b44] Summary: ⏳
AI ReviewSummaryThis PR introduces the ClickHouse Rules
Final Verdict
|
Key changes: 1. Periodic threshold refresh: After each chunk, `refreshThresholdFromStates` calls `insertResultInto` for all groups' ORDER BY aggregate, uses `IColumn::getPermutation` with limit K to find the K-th best actual aggregate value, and sets a tight boundary. This raises the threshold from the ~1st percentile (first-seen values) to the ~99.99th percentile (actual group maxes), enabling 95-99% of subsequent rows to be skipped. Capped at `limit * 10000` groups to prevent O(N_groups) blowup. 2. Mode 2 gating: Only applies when a `__topKFilter` prewhere can be pushed into `ReadFromMergeTree` (numeric, non-nullable aggregate argument, no existing prewhere). Without prewhere, falls through to the standard Aggregator pipeline which is faster due to type-dispatched hashing. 3. Mode 1 gating fix: Correctly requires `output_ordered_by_sort_key` for Mode 1 (early termination), preventing incorrect results for `argMin`/ `argMax` whose output has different sort order than the sort key. Performance results on 10M rows, 100K groups: - Mode 2 unsorted (uniform): 7ms vs 13ms baseline (1.8x faster, was 3.7x slower) - Mode 2 skewed (50M rows): 9ms vs 26ms baseline (2.9x faster) - Mode 2 memory: 4 MiB vs 271 MiB (67x less) - High cardinality (5M+ groups): 4-5x slower (known limitation) - Non-MergeTree: zero regression (falls through to standard pipeline) Made-with: Cursor
…, comparator caching Major changes to the TopNAggregation unsorted-input path (Mode 2): - Replace old `consumeMode2` (find + insert, per-row threshold check from aggregate states, periodic `refreshThresholdFromStates`) with a simpler direct loop using `HashMap::emplace`, per-chunk threshold estimation from raw input data, and `TopKThresholdTracker::testAndSet`. - Accumulate keys in dedicated `mode2_accumulated_keys` columns instead of reusing `result_columns`, avoiding confusion between Mode 1 and Mode 2. - `generateMode2Partial` emits only local top-K groups as intermediate aggregate state columns (ColumnAggregateFunction), drastically reducing fan-in to the merge transform. - Rewrite `TopNAggregatingMergeTransform`: pre-compute `agg_state_offsets`, `key_column_indices`, `agg_column_indices`; use `ArenaPtr` instead of value `Arena`; inline hash/create/destroy instead of separate methods. - Simplify destructor to unconditionally sweep `group_states`. - Remove unused `refreshThresholdFromStates`, `isBelowThreshold`, `order_agg_arg_col_idx`, `threshold_active` members. Mode 1 improvements: - Use `MergingSortedTransform` to merge N sorted streams instead of `pipeline.resize(1)`, preserving parallelism up to the merge point. - Pass `order_arg_col_name` through `TopNAggregatingStep` for sort desc. Optimizer (`optimizeTopNAggregation`): - Add `group_by_matches_sorting_prefix` guard: skip Mode 2 when GROUP BY keys match a prefix of the table sorting key (standard Aggregator is faster in this case due to two-level hash table with sorted output). - Add `unqualifyColumnName` helper for robust key matching. `FunctionTopKFilter`: - Cache the built comparator function + converted threshold, keyed by a monotonic version counter on `TopKThresholdTracker`. Rebuilds only when the threshold actually changes, avoiding redundant `build` + `convertFieldToType`. - Fix comparator to use `lessOrEquals`/`greaterOrEquals` (was `less`/`greater`), so rows exactly equal to the threshold are kept (required for correctness when the boundary group's argument equals the threshold). `TopKThresholdTracker`: - Add `version` atomic counter, incremented on every `testAndSet` that actually updates the threshold. Enables lock-free staleness detection. Made-with: Cursor
Made-with: Cursor
1. Collision-safe group identity: replace `HashMap<UInt128, size_t>` (SipHash-only key) with `HashMapWithSavedHash<std::string_view, size_t>` backed by `IColumn::serializeValueIntoArena`. This gives exact key comparison with automatic arena rollback for non-new keys via `SerializedKeyHolder`, eliminating the theoretical hash-collision correctness risk. 2. Safe threshold pruning: remove the per-chunk row-level threshold update from `consumeMode2`. Row-level K-th values can be stricter than the true group-level K-th aggregate, which would incorrectly filter rows needed by groups in the final top K. The threshold is now updated only from group-level aggregates in `generateMode2Partial`, which is provably safe (a partial's K-th group aggregate is always <= the global K-th). 3. Fix Mode 2 gating: remove the `group_by_matches_sorting_prefix` check that blocked Mode 2 even when Mode 1 was inapplicable (e.g. GROUP BY matches the sorting key but the ORDER BY aggregate argument is a different column). The `!sorted_input` guard already suffices. 4. Add documentation comments explaining why Mode 2 is intentionally not applied as a generic unsorted fallback, and noting that Mode 2 memory is O(unique groups) without in-stream eviction. 5. Add stress test for large composite keys, skewed ties, and parallel Mode 2 threshold pruning. Made-with: Cursor
The test uses `COLLATE 'en'` which requires ICU. The fast test build is compiled without ICU, causing `SUPPORT_IS_DISABLED` error. Made-with: Cursor
Add a new `topn_aggregation_pruning_level` (UInt64, 0-2) setting to control Mode 2 optimizations independently: - Level 0: direct compute only (no threshold, no filter) - Level 1: + in-transform threshold pruning - Level 2: + dynamic `__topKFilter` prewhere pushdown The dynamic filter (level 2) now respects `use_top_k_dynamic_filtering`, which defaults to false, so it is no longer injected unconditionally. Re-add in-transform threshold pruning (`refreshThresholdFromStates`, `isBelowThreshold`) to `TopNAggregatingTransform` for level 1+ support. Test changes: - Add correctness tests for all three pruning levels on MergeTree table - Add EXPLAIN tests verifying plan differences across levels - Existing EXPLAIN prewhere test now explicitly sets `use_top_k_dynamic_filtering = 1` Made-with: Cursor
The static analyzer cannot prove that `enable_threshold_pruning` being true guarantees `order_arg_col` is non-null. Check the pointer directly instead of the boolean flag to satisfy clang-analyzer-core.NonNullParamChecker. Made-with: Cursor
- Setting description for `topn_aggregation_pruning_level` now explicitly states that level 2 requires `use_top_k_dynamic_filtering` to inject the prewhere, and that level 0 is not recommended (slower than baseline). - Mode 2 now requires `pruning_level >= 1` to activate. Level 0 (direct compute without threshold) is known to regress vs the standard Aggregator pipeline, so it should not silently activate. - EXPLAIN reference for level 0 updated: now shows the standard Aggregating + Sorting + Limit pipeline (no TopNAggregating). Made-with: Cursor
Pass 0 (no limit) to `MergingSortedTransform` instead of the user's LIMIT K. K refers to the number of distinct groups, not total rows. With the previous code, if K consecutive rows belonged to the same group, MergingSorted would stop after K rows and the downstream transform would see fewer than K distinct groups. The backpressure from `TopNAggregatingTransform::consumeMode1` calling `finishConsume` is sufficient to stop reading once K groups are found. Made-with: Cursor
Header: regroup members into logical sections (configuration, column indices, state layout, per-group state, Mode 1/2 columns, threshold). Implementation: extract shared helpers to eliminate duplication: - `computeStateLayout`: state offset/alignment (was duplicated in both constructors) - `findOrderByAggIndex`: ORDER BY aggregate lookup (was duplicated in both constructors and `initColumnIndices`) - `sortAndLimitColumns`: sort-permute-limit pattern (was triplicated in `generateMode2`, `generateMode2Partial`, merge `generate`) - `prepareArgColumnPtrs`: arg column pointer setup (was duplicated in `consumeMode1` and `consumeMode2`) Also make `consumeMode1` use the shared aggregate state helpers (`createAggregateStates`/`addRowToAggregateStates`/etc.) instead of inline state management, and deduplicate the merge loop in `TopNAggregatingMergeTransform::consume`. Net: -20 lines, fewer copy-paste surfaces for future changes. Made-with: Cursor
Avoid large-`LIMIT` regressions by adding a conservative `topn_aggregation_max_limit` gate, and add `EXPLAIN` coverage that verifies default fallback and explicit override behavior. Also reduce threshold refresh overhead in `TopNAggregatingTransform` with adaptive cadence and document `GroupState` intent for future per-group metadata. Made-with: Cursor
Add `buildThresholdKeepMask` to `TopNAggregatingTransform` which uses `IColumn::compareColumn` for vectorized threshold filtering in Mode 2 instead of per-row `isBelowThreshold` calls. This amortizes virtual call overhead and enables SIMD-friendly comparison paths. Add "Design tradeoffs" and "Future improvements" sections to `topngroupby.md` covering: serialized-key HashMap vs type-dispatched tables, threshold refresh cost and adaptive backoff, dynamic filter effectiveness by data distribution, partial top-K correctness, and planned improvements (type-dispatched tables, group eviction, mixed-aggregate two-pass, distributed support, etc.). Made-with: Cursor
Made-with: Cursor
The test uses EXPLAIN to verify query plan structure, which changes when ParallelReplicas is enabled (adds MergingAggregated/Union/ ReadFromRemoteParallelReplicas nodes). This caused the reference output mismatch in the ParallelReplicas stateless test CI job. Made-with: Cursor
|
@cursor review |
…lumns The companion aggregate compatibility loop only checked `determined_by_first_row_direction` but never verified that each companion's determining argument column matches the ORDER BY aggregate's argument column. This allowed, e.g., `argMax(payload, val)` alongside `max(ts)` (val != ts) to be incorrectly optimized: Mode 1 only processes the first row per group sorted by `ts`, missing the row with max `val`; Mode 2 threshold pruning on `ts` can skip rows carrying the true max `val`. Add a check that every non-`any` companion aggregate's determining argument (`argument_names.back()`) equals the ORDER BY aggregate's determining argument. `any()` (`INT_MAX`) remains exempt since it accepts any row by definition. Made-with: Cursor
…ion tracking Merge master's `compareFields`-based comparison (collation + nullable support) with the branch's version counter for threshold change detection in `TopKThresholdTracker`. Keep master's `executeGeneral`/`executeVectorized` split in `FunctionTopKFilter`. Co-Authored-By: Claude <noreply@anthropic.com>
8e47d2c to
3c11938
Compare
|
When |
…structor The constructor was changed to accept `SortColumnDescription` (for collation/nullable support) but the call site in `optimizeTopNAggregation` still passed a bare `int sort_direction`. Co-Authored-By: Claude <noreply@anthropic.com>
At topn_aggregation_pruning_level = 2, it uses both. The
When those conditions aren't met, only the PREWHERE filtering is active. Ref: #99033 |
…sHistory` The settings were registered under 26.3 but the current development version is 26.4, causing `02995_new_settings_history` to fail. ClickHouse#98607 Co-Authored-By: Claude <noreply@anthropic.com>
| /// Mode 1 requires output_ordered_by_sort_key because early termination depends | ||
| /// on the aggregate result being monotonically ordered with the sort key. For example, | ||
| /// argMin(payload, ts) returns payload which has different ordering than ts. | ||
| if (read_from_mt && order_info.output_ordered_by_sort_key) |
There was a problem hiding this comment.
Mode 1 is enabled based only on sort direction, but this path does not validate SQL ordering semantics beyond direction.
At this point we only check output_ordered_by_sort_key and call requestReadingInOrder(1, required_direction, limit_value), while the original SortingStep can also carry COLLATE and NULLS FIRST/LAST semantics (sort_desc[0].collator / nulls_direction).
If query ORDER BY uses collation (and potentially nullable semantics), physical MergeTree read order is not guaranteed to match that order, so early termination after K groups can return wrong results.
Please add a strict guard before enabling sorted_input: at minimum reject this optimization when sort_desc[0].collator is set; and for nullable argument columns, only enable when null ordering is proven equivalent to storage read order.
- Fix `03470_topn_aggregation_parallel_replicas` test failure in "old analyzer" CI config: move `allow_experimental_analyzer = 1` from subquery SETTINGS to session-level SET, because changing this setting inside a subquery is disallowed when the top-level value differs. - Fix potential overflow in `maybeRefreshThreshold`: replace `groups > limit * 10000` with `groups / 10000 > limit` to avoid wrapping when `limit` is large. ClickHouse#98607 Co-Authored-By: Claude <noreply@anthropic.com>
…into murphy_issue_75098
| order_col.getPermutation( | ||
| direction, IColumn::PermutationSortStability::Unstable, | ||
| output_limit, sort_direction, perm); | ||
| return perm; |
There was a problem hiding this comment.
TopNAggregatingTransform uses sort_direction as nan_direction_hint in threshold ordering (getPermutation and compareColumn), but final output sorting uses sort_description.front().nulls_direction.
For floating-point aggregates with explicit NULLS FIRST/LAST (which controls NaN placement), pruning can evaluate rows with a different ordering than the final ORDER BY, so rows can be incorrectly dropped/kept before final sort.
Please use the same nan_direction_hint as final sorting (sort_description.front().nulls_direction) in threshold-related comparisons/permutations.
|
|
||
| - [GROUP BY optimization](/sql-reference/statements/select/group-by#group-by-optimization-depending-on-table-sorting-key) | ||
| )", 0) \ | ||
| DECLARE(Bool, optimize_topn_aggregation, false, R"( |
There was a problem hiding this comment.
This PR introduces new user-facing settings (optimize_topn_aggregation, topn_aggregation_pruning_level, topn_aggregation_max_limit) but the PR template still has documentation unchecked and there is no docs update in this PR.
For a new feature, please add user documentation (or explicitly include the docs text in the PR) and mark the docs checkbox accordingly.
| UInt64 pruning_level = optimization_settings.topn_aggregation_pruning_level; | ||
| bool mode2_limit_ok = limit_value <= optimization_settings.topn_aggregation_max_limit; | ||
|
|
||
| if (!sorted_input && mode2_limit_ok && pruning_level >= 1 |
There was a problem hiding this comment.
topn_aggregation_pruning_level=0 is documented as “direct compute only” (Settings.cpp), but this branch requires pruning_level >= 1, so Mode 2 never activates at level 0.
That makes the setting behavior inconsistent with user-facing docs and with the comment above (level 0 — direct compute only). Please either:
- allow Mode 2 at level
0(without threshold/pruning), or - update the setting docs/comments to state that level
0disables Mode 2 entirely.
|
Briefly looking at the code, it seemed to me that mode1 and mode2 does not share too much logic in common (maybe a small base class for common code?). I think it will be more readable if we have two different transforms (with good descriptive names) instead. Then, we can dispatch them via |
| @@ -0,0 +1,648 @@ | |||
| -- Tags: no-random-settings, no-fasttest, no-parallel-replicas | |||
There was a problem hiding this comment.
-- Tags: no-random-settings
I understand why we have this setting to make sure that EXPLAIN queries do not become flaky. However, random settings are quite useful in finding bugs.
That's why I propose that we can have two test file: this one and equivalent 03467_topn_aggregation_explain.sql. 03467_topn_aggregation.sql can have all the correctness tests. Then, for each test 03467_topn_aggregation_explain.sql will verify that TopNAggregatingTransform is actually used. We can keep no-random-settings only for the explain test.
no-fasttest
I think we can drop it. I see no reason to keep it.
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 84.40% (952/1128, 1 noise lines excluded) |
The monolithic `TopNAggregatingTransform` handled both sorted-input early-termination (Mode 1) and hash-based direct aggregation (Mode 2) with runtime dispatch. These modes share little logic beyond column index mapping and aggregate state lifecycle. Split into three classes: - `TopNAggregatingTransformBase` -- base class with shared infrastructure (column indices, key serialization, aggregate state helpers) - `TopNSortedAggregatingTransform` -- Mode 1: sorted input, stops after K distinct groups - `TopNDirectAggregatingTransform` -- Mode 2: hash aggregation with optional threshold pruning, supports partial output for parallel workers `TopNAggregatingStep` now dispatches to the appropriate transform class at construction time, which is natural since it already knows the mode. `TopNAggregatingMergeTransform` is unchanged. Made-with: Cursor
- Fix NaN direction hint: use `nulls_direction` from `sort_description` consistently in `getSortPermutation`, `isBelowThreshold`, and `buildThresholdKeepMask` instead of `sort_direction`, matching the final output sorting semantics. - Add Mode 1 collation guard: reject sorted-input early termination when `sort_desc[0].collator` is set, because physical MergeTree order may not match collation order. - Clarify `topn_aggregation_pruning_level` docs: level 0 disables Mode 2 entirely (not "direct compute only"), only Mode 1 can apply. - Split `03467_topn_aggregation` test into correctness file (allows random settings) and EXPLAIN file (keeps `no-random-settings`); drop `no-fasttest` tag. Made-with: Cursor
- Add `chassert` to verify column type consistency between input and boundary columns in `isBelowThreshold` and `buildThresholdKeepMask`. - Guard `refreshThresholdFromStates` against `limit==0` to prevent unsigned underflow in `std::min(limit, n) - 1`. - Add missing "max + any" reference queries in all three test files to verify companion aggregate correctness. - Add negative EXPLAIN cases for `min + DESC` (wrong direction) and multi-column `ORDER BY` to improve optimizer coverage. - Reset `allow_suspicious_low_cardinality_types` after table creation to avoid session pollution in test files. - Fix misleading "ties" comment (aggregate values are actually distinct). Made-with: Cursor
- Fix `TopKThresholdTracker::compareFields` to treat NaN the same as column-level code: NaN is ordered via `nulls_direction` instead of being treated as equal to all values by `Field::operator<`. This prevents the shared threshold from disagreeing with per-row pruning in `TopNDirectAggregatingTransform` and `FunctionTopKFilter`. - Reject Mode 1 (sorted-input early termination) when the determining aggregate argument is nullable. MergeTree physical NULL ordering may differ from the query's `NULLS FIRST`/`NULLS LAST`, so early termination after K groups could return wrong results. - Expand `optimize_topn_aggregation` setting description to document Mode 1 vs Mode 2 behavior, eligibility requirements, and current limitations (nullable, collation, float/NaN caveats). - Add regression tests for Float64 NaN with explicit `NULLS FIRST` and `NULLS LAST` under Mode 2, and for nullable-primary-key tables under Mode 1. Made-with: Cursor
Route floating-point types through the `executeGeneral` path in `FunctionTopKFilter` instead of the vectorized `greaterOrEquals` / `lessOrEquals` fast path. The vectorized comparisons return `false` for NaN, which can incorrectly drop rows that the transform-level pruning (using `compareColumn` with `nulls_direction`) would keep. Add regression tests for Float64 NaN at `topn_aggregation_pruning_level = 2` with `use_top_k_dynamic_filtering = 1`. Made-with: Cursor
Document the TopN aggregation optimization in the GROUP BY reference page: Mode 1 vs Mode 2 behavior, eligibility requirements, current limitations, example query, and related settings. Made-with: Cursor
The correctness test uses `COLLATE 'en'` which requires ICU, not available in the fast test environment. Made-with: Cursor
… columns The flaky check with random settings (e.g. `max_block_size=1`) can produce `ColumnConst` or other wrapped column types that aggregate functions cannot handle via `assert_cast`. Use `convertToFullIfNeeded` (which handles Const, Replicated, Sparse, and LowCardinality wrappers) instead of only converting LowCardinality, matching what the standard `Aggregator` does. Made-with: Cursor
- Unwrap key columns with `convertToFullIfNeeded` (like aggregate arg columns) so that `insertFrom` receives concrete column types instead of `ColumnConst`/`ColumnSparse` wrappers that cause `assert_cast` failures. Applied to Mode 1, Mode 2, and merge transform. - Add try/catch rollback in `createAggregateStates` matching the standard `Aggregator` pattern: if `create` throws for aggregate j, destroy already-created states 0..j-1. - Add try/catch in Mode 1 `consume` around temporary state usage so `destroyAggregateStates` is called even if `addRowToAggregateStates` or `insertResultsFromStates` throws. - Move `++num_groups` before `addRowToAggregateStates` in Mode 2 so `accumulated_keys.size()` and `num_groups` stay consistent if `add` throws after `group_states.push_back`. Made-with: Cursor
Summary
This PR introduces a fused
TopNAggregatingquery plan operator for queries like:It replaces the
AggregatingStep -> SortingStep -> LimitStepchain with a singleTopNAggregatingStepand supports two execution modes:Kunique groups.__topKFiltertoPREWHEREfor storage-level skipping.Closes #75098
Newly introduced settings
optimize_topn_aggregation(default0): enablesTopNAggregationrewrite.topn_aggregation_pruning_level(default2): controls Mode 2 pruning stack:0: direct compute only1: + in-transform threshold pruning2: + dynamic__topKFilterpushdown (whenuse_top_k_dynamic_filtering=1)topn_aggregation_max_limit(default1000): conservative Mode 2 gate; for largerLIMITvalues optimizer falls back to standard aggregation/sort pipeline to avoid known large-Kregressions.Performance (short)
topn_aggregation_max_limit=1000defaults to conservative fallback and avoids worst cases.Changelog category:
Changelog entry:
Added
optimize_topn_aggregationfor fusedGROUP BY ... ORDER BY aggregate LIMIT Kexecution, withtopn_aggregation_pruning_levelcontrols for Mode 2 pruning andtopn_aggregation_max_limitfor conservative large-Kfallback.Documentation entry for user-facing changes
Made with Cursor
Note
Medium Risk
Adds a new query-plan rewrite and execution operator (plus dynamic PREWHERE injection) that can change performance/correctness for eligible queries, though it is guarded by a new setting defaulting to off. Risk is mainly in plan-pattern matching and the new threshold-pruning path across nullable/collation edge cases (partly gated/disabled).
Overview
Adds an optional TopN aggregation optimization (
optimize_topn_aggregation) that rewritesLimit -> Sorting -> Aggregatinginto a singleTopNAggregatingStepforGROUP BY ... ORDER BY <aggregate> LIMIT Kwhen all aggregates are compatible.Extends
IAggregateFunctionwithgetTopKAggregateInfo()and implements it formin/max,any, andargMin/argMaxto drive the rewrite and mode selection (sorted-input early termination vs unsorted threshold pruning with optional__topKFilterPREWHERE pushdown), controlled by new settingstopn_aggregation_pruning_levelandtopn_aggregation_max_limit.Updates
__topKFilterto use inclusive comparisons, adds versioned/cached comparator building backed by a newTopKThresholdTrackerversion counter, exposesLimitStep::getOffset()for rewrite gating, and adds stateless correctness/stress tests covering positive and negative cases.Written by Cursor Bugbot for commit b2e47a2. This will update automatically on new commits. Configure here.