Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K by murphy-4o · Pull Request #98607 · ClickHouse/ClickHouse

murphy-4o · 2026-03-03T12:10:06Z

Summary

This PR introduces a fused TopNAggregating query plan operator for queries like:

SELECT key, max(col) AS m
FROM table
GROUP BY key
ORDER BY m DESC
LIMIT K

It replaces the AggregatingStep -> SortingStep -> LimitStep chain with a single TopNAggregatingStep and supports two execution modes:

Mode 1 (sorted input / early termination): if table sort key matches the ORDER BY aggregate argument and ordering is compatible, read in order and stop after K unique groups.
Mode 2 (unsorted input / threshold pruning): aggregate directly with a dynamic K-th boundary, prune rows below boundary in transform, and optionally push boundary as __topKFilter to PREWHERE for storage-level skipping.

Closes #75098

Newly introduced settings

optimize_topn_aggregation (default 0): enables TopNAggregation rewrite.
topn_aggregation_pruning_level (default 2): controls Mode 2 pruning stack:
- 0: direct compute only
- 1: + in-transform threshold pruning
- 2: + dynamic __topKFilter pushdown (when use_top_k_dynamic_filtering=1)
topn_aggregation_max_limit (default 1000): conservative Mode 2 gate; for larger LIMIT values optimizer falls back to standard aggregation/sort pipeline to avoid known large-K regressions.

Performance (short)

Mode 1: up to ~15x speedup on sorted input due to early termination.
Mode 2 (small K): typically ~2-5x in multi-thread and ~1.6-9.9x in single-thread cases.
Large K: can regress without gating; topn_aggregation_max_limit=1000 defaults to conservative fallback and avoids worst cases.

Changelog category:

Performance Improvement

Changelog entry:

Added optimize_topn_aggregation for fused GROUP BY ... ORDER BY aggregate LIMIT K execution, with topn_aggregation_pruning_level controls for Mode 2 pruning and topn_aggregation_max_limit for conservative large-K fallback.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Made with Cursor

Note

Medium Risk
Adds a new query-plan rewrite and execution operator (plus dynamic PREWHERE injection) that can change performance/correctness for eligible queries, though it is guarded by a new setting defaulting to off. Risk is mainly in plan-pattern matching and the new threshold-pruning path across nullable/collation edge cases (partly gated/disabled).

Overview
Adds an optional TopN aggregation optimization (optimize_topn_aggregation) that rewrites Limit -> Sorting -> Aggregating into a single TopNAggregatingStep for GROUP BY ... ORDER BY <aggregate> LIMIT K when all aggregates are compatible.

Extends IAggregateFunction with getTopKAggregateInfo() and implements it for min/max, any, and argMin/argMax to drive the rewrite and mode selection (sorted-input early termination vs unsorted threshold pruning with optional __topKFilter PREWHERE pushdown), controlled by new settings topn_aggregation_pruning_level and topn_aggregation_max_limit.

Updates __topKFilter to use inclusive comparisons, adds versioned/cached comparator building backed by a new TopKThresholdTracker version counter, exposes LimitStep::getOffset() for rewrite gating, and adds stateless correctness/stress tests covering positive and negative cases.

^{Written by Cursor Bugbot for commit b2e47a2. This will update automatically on new commits. Configure here.}

…gregate LIMIT K Add `TopNAggregatingStep` and `TopNAggregatingTransform` that fuse aggregation, sorting, and limiting into a single pass for queries like: SELECT key, max(ts) AS m FROM t GROUP BY key ORDER BY m DESC LIMIT 5 The optimization replaces the Aggregating → Sorting → Limit plan chain with a single `TopNAggregating` step. Two modes are supported: - Mode 1 (sorted input): when the table's ORDER BY key matches the aggregate argument, the operator requests reverse reading from `ReadFromMergeTree` and stops after K unique groups — O(K) memory, minimal rows read. - Mode 2 (unsorted input): processes all rows, aggregates per group, then partial-sorts to select the top K — eliminates the full sort. Supported aggregates: `min`, `max`, `any`, `argMin`, `argMax`. Each declares its compatibility via a new `getTopKAggregateInfo` virtual method on `IAggregateFunction`. The optimization is gated by the `optimize_topn_aggregation` setting (default off). Made-with: Cursor

1. Block optimization when OFFSET > 0 — the optimizer now checks `LimitStep::getOffset` and bails out, preventing incorrect results for `LIMIT K OFFSET N` queries. 2. Eliminate per-row arena growth in key serialization — replaced `serializeValueIntoArena` (which allocated into the arena on every row) with a pure `SipHash`-128 key. Arena is now only used for aggregate state allocation, bounded by number of groups. 3. Preserve full `SortColumnDescription` — the optimizer now copies the original `nulls_direction` and `collator` from the `SortingStep` instead of rebuilding with defaults. Mode 2's `compareAt` uses the correct `nulls_direction` for NULL ordering. 4. Document mode-2 as intentional v1 simplification — the header comment now states that threshold-based pruning is not yet implemented. 5. Add test coverage for OFFSET (negative case) and NULLS FIRST/LAST (correctness comparison between optimized and unoptimized paths). Made-with: Cursor

1. Collision-safe group keys: Replace hash-only `std::string` key with `UInt128` SipHash + exact column comparison via `unordered_multimap`. On hash match, compares actual key column values against stored groups in `result_columns`, eliminating the theoretical collision risk. 2. Collation-aware compare in Mode 2: `generateMode2` now checks for a `collator` in `SortColumnDescription` and uses `compareAtWithCollation` when present, ensuring correct ordering for string ORDER BY with explicit collation (e.g., COLLATE 'en'). 3. Mode 1 safety for argMin/argMax/any: Add `output_ordered_by_sort_key` flag to `TopKAggregateInfo`. Only min/max set it to true (their output is monotonically related to the sort key). The optimizer now only enables Mode 1 (early termination with sorted input) when this flag is true, preventing incorrect results for argMin/argMax where the output column differs from the sort key. 4. Extended test coverage: - Collation-sensitive ordering (COLLATE 'en') optimized vs reference - Mode 1 EXPLAIN PLAN + correctness with small LIMIT - argMin/argMax ASC/DESC correctness parity - Tie-heavy dataset (100 groups, deterministic values) - Multi-aggregate (max + argMax) correctness - EXPLAIN PLAN for argMin optimization Made-with: Cursor

The `.cpp` was out of sync with the `.h` header due to a file-write tool failure in the previous session. The header declared the new collision-safe API (`hashGroupKey`, `findGroupIndex`, `unordered_multimap<UInt128, size_t, UInt128Hash>`), but the `.cpp` still contained the old `serializeGroupKey` returning `std::string` with hash-only identity. This commit makes `.cpp` consistent with `.h`: - `hashGroupKey`: returns `UInt128` SipHash digest of key columns. - `findGroupIndex`: on hash match, performs exact column-by-column comparison against stored keys in `result_columns` via `compareAt`, eliminating the hash-collision correctness risk. - Both `consumeMode1` and `consumeMode2` now use the new `unordered_multimap::emplace` / `equal_range` API. Made-with: Cursor

clickhouse-gh · 2026-03-03T12:10:45Z

Workflow [PR], commit [0320b44]

Summary: ⏳

job_name	test_name	status	info
Stateless tests (arm_asan_ubsan, flaky check)		failure
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
Stateless tests (amd_asan_ubsan, flaky check)		failure
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
Stateless tests (amd_tsan, flaky check)		failure
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
Stateless tests (amd_msan, flaky check)		failure
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
Stateless tests (amd_debug, flaky check)		failure
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
	03467_topn_aggregation	FAIL	cidb
Stateless tests (amd_debug, parallel)		failure
	03467_topn_aggregation	FAIL	cidb

AI Review

Summary

This PR introduces the TopNAggregating rewrite/operator for GROUP BY ... ORDER BY aggregate LIMIT K, plus related settings, docs, and tests. I reviewed the changed C++ optimizer/transform paths, aggregate metadata wiring, threshold/filter logic, and tests for correctness/safety/performance risks. I did not find any new high-confidence blocker or major issue that is not already covered by existing clickhouse-gh[bot] inline comments.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	✅	Feature is behind `optimize_topn_aggregation` (default off).
No magic constants	✅
Backward compatibility	✅	New behavior is opt-in via settings; `SettingsChangesHistory.cpp` updated.
`SettingsChangesHistory.cpp`	✅
PR metadata quality	✅	Changelog category/entry match the change; docs were added in this PR.
Safe rollout	✅
Compilation time	✅	No obvious high-fan-out header bloat beyond lightweight metadata additions.

Final Verdict

Status: ✅ Approve

Key changes: 1. Periodic threshold refresh: After each chunk, `refreshThresholdFromStates` calls `insertResultInto` for all groups' ORDER BY aggregate, uses `IColumn::getPermutation` with limit K to find the K-th best actual aggregate value, and sets a tight boundary. This raises the threshold from the ~1st percentile (first-seen values) to the ~99.99th percentile (actual group maxes), enabling 95-99% of subsequent rows to be skipped. Capped at `limit * 10000` groups to prevent O(N_groups) blowup. 2. Mode 2 gating: Only applies when a `__topKFilter` prewhere can be pushed into `ReadFromMergeTree` (numeric, non-nullable aggregate argument, no existing prewhere). Without prewhere, falls through to the standard Aggregator pipeline which is faster due to type-dispatched hashing. 3. Mode 1 gating fix: Correctly requires `output_ordered_by_sort_key` for Mode 1 (early termination), preventing incorrect results for `argMin`/ `argMax` whose output has different sort order than the sort key. Performance results on 10M rows, 100K groups: - Mode 2 unsorted (uniform): 7ms vs 13ms baseline (1.8x faster, was 3.7x slower) - Mode 2 skewed (50M rows): 9ms vs 26ms baseline (2.9x faster) - Mode 2 memory: 4 MiB vs 271 MiB (67x less) - High cardinality (5M+ groups): 4-5x slower (known limitation) - Non-MergeTree: zero regression (falls through to standard pipeline) Made-with: Cursor

…, comparator caching Major changes to the TopNAggregation unsorted-input path (Mode 2): - Replace old `consumeMode2` (find + insert, per-row threshold check from aggregate states, periodic `refreshThresholdFromStates`) with a simpler direct loop using `HashMap::emplace`, per-chunk threshold estimation from raw input data, and `TopKThresholdTracker::testAndSet`. - Accumulate keys in dedicated `mode2_accumulated_keys` columns instead of reusing `result_columns`, avoiding confusion between Mode 1 and Mode 2. - `generateMode2Partial` emits only local top-K groups as intermediate aggregate state columns (ColumnAggregateFunction), drastically reducing fan-in to the merge transform. - Rewrite `TopNAggregatingMergeTransform`: pre-compute `agg_state_offsets`, `key_column_indices`, `agg_column_indices`; use `ArenaPtr` instead of value `Arena`; inline hash/create/destroy instead of separate methods. - Simplify destructor to unconditionally sweep `group_states`. - Remove unused `refreshThresholdFromStates`, `isBelowThreshold`, `order_agg_arg_col_idx`, `threshold_active` members. Mode 1 improvements: - Use `MergingSortedTransform` to merge N sorted streams instead of `pipeline.resize(1)`, preserving parallelism up to the merge point. - Pass `order_arg_col_name` through `TopNAggregatingStep` for sort desc. Optimizer (`optimizeTopNAggregation`): - Add `group_by_matches_sorting_prefix` guard: skip Mode 2 when GROUP BY keys match a prefix of the table sorting key (standard Aggregator is faster in this case due to two-level hash table with sorted output). - Add `unqualifyColumnName` helper for robust key matching. `FunctionTopKFilter`: - Cache the built comparator function + converted threshold, keyed by a monotonic version counter on `TopKThresholdTracker`. Rebuilds only when the threshold actually changes, avoiding redundant `build` + `convertFieldToType`. - Fix comparator to use `lessOrEquals`/`greaterOrEquals` (was `less`/`greater`), so rows exactly equal to the threshold are kept (required for correctness when the boundary group's argument equals the threshold). `TopKThresholdTracker`: - Add `version` atomic counter, incremented on every `testAndSet` that actually updates the threshold. Enables lock-free staleness detection. Made-with: Cursor

Made-with: Cursor

1. Collision-safe group identity: replace `HashMap<UInt128, size_t>` (SipHash-only key) with `HashMapWithSavedHash<std::string_view, size_t>` backed by `IColumn::serializeValueIntoArena`. This gives exact key comparison with automatic arena rollback for non-new keys via `SerializedKeyHolder`, eliminating the theoretical hash-collision correctness risk. 2. Safe threshold pruning: remove the per-chunk row-level threshold update from `consumeMode2`. Row-level K-th values can be stricter than the true group-level K-th aggregate, which would incorrectly filter rows needed by groups in the final top K. The threshold is now updated only from group-level aggregates in `generateMode2Partial`, which is provably safe (a partial's K-th group aggregate is always <= the global K-th). 3. Fix Mode 2 gating: remove the `group_by_matches_sorting_prefix` check that blocked Mode 2 even when Mode 1 was inapplicable (e.g. GROUP BY matches the sorting key but the ORDER BY aggregate argument is a different column). The `!sorted_input` guard already suffices. 4. Add documentation comments explaining why Mode 2 is intentionally not applied as a generic unsorted fallback, and noting that Mode 2 memory is O(unique groups) without in-stream eviction. 5. Add stress test for large composite keys, skewed ties, and parallel Mode 2 threshold pruning. Made-with: Cursor

The test uses `COLLATE 'en'` which requires ICU. The fast test build is compiled without ICU, causing `SUPPORT_IS_DISABLED` error. Made-with: Cursor

Add a new `topn_aggregation_pruning_level` (UInt64, 0-2) setting to control Mode 2 optimizations independently: - Level 0: direct compute only (no threshold, no filter) - Level 1: + in-transform threshold pruning - Level 2: + dynamic `__topKFilter` prewhere pushdown The dynamic filter (level 2) now respects `use_top_k_dynamic_filtering`, which defaults to false, so it is no longer injected unconditionally. Re-add in-transform threshold pruning (`refreshThresholdFromStates`, `isBelowThreshold`) to `TopNAggregatingTransform` for level 1+ support. Test changes: - Add correctness tests for all three pruning levels on MergeTree table - Add EXPLAIN tests verifying plan differences across levels - Existing EXPLAIN prewhere test now explicitly sets `use_top_k_dynamic_filtering = 1` Made-with: Cursor

The static analyzer cannot prove that `enable_threshold_pruning` being true guarantees `order_arg_col` is non-null. Check the pointer directly instead of the boolean flag to satisfy clang-analyzer-core.NonNullParamChecker. Made-with: Cursor

- Setting description for `topn_aggregation_pruning_level` now explicitly states that level 2 requires `use_top_k_dynamic_filtering` to inject the prewhere, and that level 0 is not recommended (slower than baseline). - Mode 2 now requires `pruning_level >= 1` to activate. Level 0 (direct compute without threshold) is known to regress vs the standard Aggregator pipeline, so it should not silently activate. - EXPLAIN reference for level 0 updated: now shows the standard Aggregating + Sorting + Limit pipeline (no TopNAggregating). Made-with: Cursor

Pass 0 (no limit) to `MergingSortedTransform` instead of the user's LIMIT K. K refers to the number of distinct groups, not total rows. With the previous code, if K consecutive rows belonged to the same group, MergingSorted would stop after K rows and the downstream transform would see fewer than K distinct groups. The backpressure from `TopNAggregatingTransform::consumeMode1` calling `finishConsume` is sufficient to stop reading once K groups are found. Made-with: Cursor

Header: regroup members into logical sections (configuration, column indices, state layout, per-group state, Mode 1/2 columns, threshold). Implementation: extract shared helpers to eliminate duplication: - `computeStateLayout`: state offset/alignment (was duplicated in both constructors) - `findOrderByAggIndex`: ORDER BY aggregate lookup (was duplicated in both constructors and `initColumnIndices`) - `sortAndLimitColumns`: sort-permute-limit pattern (was triplicated in `generateMode2`, `generateMode2Partial`, merge `generate`) - `prepareArgColumnPtrs`: arg column pointer setup (was duplicated in `consumeMode1` and `consumeMode2`) Also make `consumeMode1` use the shared aggregate state helpers (`createAggregateStates`/`addRowToAggregateStates`/etc.) instead of inline state management, and deduplicate the merge loop in `TopNAggregatingMergeTransform::consume`. Net: -20 lines, fewer copy-paste surfaces for future changes. Made-with: Cursor

Avoid large-`LIMIT` regressions by adding a conservative `topn_aggregation_max_limit` gate, and add `EXPLAIN` coverage that verifies default fallback and explicit override behavior. Also reduce threshold refresh overhead in `TopNAggregatingTransform` with adaptive cadence and document `GroupState` intent for future per-group metadata. Made-with: Cursor

Add `buildThresholdKeepMask` to `TopNAggregatingTransform` which uses `IColumn::compareColumn` for vectorized threshold filtering in Mode 2 instead of per-row `isBelowThreshold` calls. This amortizes virtual call overhead and enables SIMD-friendly comparison paths. Add "Design tradeoffs" and "Future improvements" sections to `topngroupby.md` covering: serialized-key HashMap vs type-dispatched tables, threshold refresh cost and adaptive backoff, dynamic filter effectiveness by data distribution, partial top-K correctness, and planned improvements (type-dispatched tables, group eviction, mixed-aggregate two-pass, distributed support, etc.). Made-with: Cursor

Made-with: Cursor

The test uses EXPLAIN to verify query plan structure, which changes when ParallelReplicas is enabled (adds MergingAggregated/Union/ ReadFromRemoteParallelReplicas nodes). This caused the reference output mismatch in the ParallelReplicas stateless test CI job. Made-with: Cursor

murphy-4o · 2026-03-05T12:12:44Z

@cursor review

…lumns The companion aggregate compatibility loop only checked `determined_by_first_row_direction` but never verified that each companion's determining argument column matches the ORDER BY aggregate's argument column. This allowed, e.g., `argMax(payload, val)` alongside `max(ts)` (val != ts) to be incorrectly optimized: Mode 1 only processes the first row per group sorted by `ts`, missing the row with max `val`; Mode 2 threshold pruning on `ts` can skip rows carrying the true max `val`. Add a check that every non-`any` companion aggregate's determining argument (`argument_names.back()`) equals the ORDER BY aggregate's determining argument. `any()` (`INT_MAX`) remains exempt since it accepts any row by definition. Made-with: Cursor

…ion tracking Merge master's `compareFields`-based comparison (collation + nullable support) with the branch's version counter for threshold change detection in `TopKThresholdTracker`. Keep master's `executeGeneral`/`executeVectorized` split in `FunctionTopKFilter`. Co-Authored-By: Claude <noreply@anthropic.com>

EmeraldShift · 2026-03-21T12:21:44Z

When topn_aggregation_pruning_level is 2, does this use the TopKThreshold-based skip index filtering? Or only the PREWHERE one?

src/Processors/Transforms/TopNAggregatingTransform.cpp

…structor The constructor was changed to accept `SortColumnDescription` (for collation/nullable support) but the call site in `optimizeTopNAggregation` still passed a bare `int sort_direction`. Co-Authored-By: Claude <noreply@anthropic.com>

murphy-4o · 2026-03-21T12:49:56Z

When topn_aggregation_pruning_level is 2, does this use the TopKThreshold-based skip index filtering? Or only the PREWHERE one?

At topn_aggregation_pruning_level = 2, it uses both. The PREWHERE (__topKFilter) and the TopKThresholdTracker-based skip index filtering are independent mechanisms. The skip index
filtering is applied as best as possible, but has limitations:

Requires a minmax skip index on the ORDER BY aggregate argument column
Column must be numeric (isValueRepresentedByNumber)
Column must be non-nullable
No collation support

When those conditions aren't met, only the PREWHERE filtering is active.

Ref: #99033

…sHistory` The settings were registered under 26.3 but the current development version is 26.4, causing `02995_new_settings_history` to fail. ClickHouse#98607 Co-Authored-By: Claude <noreply@anthropic.com>

clickhouse-gh · 2026-03-21T13:00:56Z

src/Processors/QueryPlan/Optimizations/optimizeTopNAggregation.cpp

+    /// Mode 1 requires output_ordered_by_sort_key because early termination depends
+    /// on the aggregate result being monotonically ordered with the sort key. For example,
+    /// argMin(payload, ts) returns payload which has different ordering than ts.
+    if (read_from_mt && order_info.output_ordered_by_sort_key)


Mode 1 is enabled based only on sort direction, but this path does not validate SQL ordering semantics beyond direction.

At this point we only check output_ordered_by_sort_key and call requestReadingInOrder(1, required_direction, limit_value), while the original SortingStep can also carry COLLATE and NULLS FIRST/LAST semantics (sort_desc[0].collator / nulls_direction).

If query ORDER BY uses collation (and potentially nullable semantics), physical MergeTree read order is not guaranteed to match that order, so early termination after K groups can return wrong results.

Please add a strict guard before enabling sorted_input: at minimum reject this optimization when sort_desc[0].collator is set; and for nullable argument columns, only enable when null ordering is proven equivalent to storage read order.

- Fix `03470_topn_aggregation_parallel_replicas` test failure in "old analyzer" CI config: move `allow_experimental_analyzer = 1` from subquery SETTINGS to session-level SET, because changing this setting inside a subquery is disallowed when the top-level value differs. - Fix potential overflow in `maybeRefreshThreshold`: replace `groups > limit * 10000` with `groups / 10000 > limit` to avoid wrapping when `limit` is large. ClickHouse#98607 Co-Authored-By: Claude <noreply@anthropic.com>

…into murphy_issue_75098

clickhouse-gh · 2026-03-22T06:29:02Z

src/Processors/Transforms/TopNAggregatingTransform.cpp

+    order_col.getPermutation(
+        direction, IColumn::PermutationSortStability::Unstable,
+        output_limit, sort_direction, perm);
+    return perm;


TopNAggregatingTransform uses sort_direction as nan_direction_hint in threshold ordering (getPermutation and compareColumn), but final output sorting uses sort_description.front().nulls_direction.

For floating-point aggregates with explicit NULLS FIRST/LAST (which controls NaN placement), pruning can evaluate rows with a different ordering than the final ORDER BY, so rows can be incorrectly dropped/kept before final sort.

Please use the same nan_direction_hint as final sorting (sort_description.front().nulls_direction) in threshold-related comparisons/permutations.

clickhouse-gh · 2026-03-22T09:46:54Z

src/Core/Settings.cpp


 - [GROUP BY optimization](/sql-reference/statements/select/group-by#group-by-optimization-depending-on-table-sorting-key)
+)", 0) \
+    DECLARE(Bool, optimize_topn_aggregation, false, R"(


This PR introduces new user-facing settings (optimize_topn_aggregation, topn_aggregation_pruning_level, topn_aggregation_max_limit) but the PR template still has documentation unchecked and there is no docs update in this PR.

For a new feature, please add user documentation (or explicitly include the docs text in the PR) and mark the docs checkbox accordingly.

clickhouse-gh · 2026-03-22T19:24:10Z

src/Processors/QueryPlan/Optimizations/optimizeTopNAggregation.cpp

+    UInt64 pruning_level = optimization_settings.topn_aggregation_pruning_level;
+    bool mode2_limit_ok = limit_value <= optimization_settings.topn_aggregation_max_limit;
+
+    if (!sorted_input && mode2_limit_ok && pruning_level >= 1


topn_aggregation_pruning_level=0 is documented as “direct compute only” (Settings.cpp), but this branch requires pruning_level >= 1, so Mode 2 never activates at level 0.

That makes the setting behavior inconsistent with user-facing docs and with the comment above (level 0 — direct compute only). Please either:

allow Mode 2 at level 0 (without threshold/pruning), or

update the setting docs/comments to state that level 0 disables Mode 2 entirely.

nihalzp · 2026-03-22T19:55:53Z

Briefly looking at the code, it seemed to me that mode1 and mode2 does not share too much logic in common (maybe a small base class for common code?). I think it will be more readable if we have two different transforms (with good descriptive names) instead. Then, we can dispatch them via TopNAggregatingStep; this seems natural to me because TopNAggregatingStep already knows the mode at construction.

nihalzp · 2026-03-22T20:14:35Z

tests/queries/0_stateless/03467_topn_aggregation.sql

@@ -0,0 +1,648 @@
+-- Tags: no-random-settings, no-fasttest, no-parallel-replicas


-- Tags: no-random-settings

I understand why we have this setting to make sure that EXPLAIN queries do not become flaky. However, random settings are quite useful in finding bugs.

That's why I propose that we can have two test file: this one and equivalent 03467_topn_aggregation_explain.sql. 03467_topn_aggregation.sql can have all the correctness tests. Then, for each test 03467_topn_aggregation_explain.sql will verify that TopNAggregatingTransform is actually used. We can keep no-random-settings only for the explain test.

no-fasttest

I think we can drop it. I see no reason to keep it.

clickhouse-gh · 2026-03-22T22:08:44Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	83.90%	84.00%	+0.10%
Functions	24.80%	24.70%	-0.10%
Branches	76.50%	76.50%	+0.00%

PR changed lines: PR changed-lines coverage: 84.40% (952/1128, 1 noise lines excluded)
Diff coverage report
Uncovered code

The monolithic `TopNAggregatingTransform` handled both sorted-input early-termination (Mode 1) and hash-based direct aggregation (Mode 2) with runtime dispatch. These modes share little logic beyond column index mapping and aggregate state lifecycle. Split into three classes: - `TopNAggregatingTransformBase` -- base class with shared infrastructure (column indices, key serialization, aggregate state helpers) - `TopNSortedAggregatingTransform` -- Mode 1: sorted input, stops after K distinct groups - `TopNDirectAggregatingTransform` -- Mode 2: hash aggregation with optional threshold pruning, supports partial output for parallel workers `TopNAggregatingStep` now dispatches to the appropriate transform class at construction time, which is natural since it already knows the mode. `TopNAggregatingMergeTransform` is unchanged. Made-with: Cursor

- Fix NaN direction hint: use `nulls_direction` from `sort_description` consistently in `getSortPermutation`, `isBelowThreshold`, and `buildThresholdKeepMask` instead of `sort_direction`, matching the final output sorting semantics. - Add Mode 1 collation guard: reject sorted-input early termination when `sort_desc[0].collator` is set, because physical MergeTree order may not match collation order. - Clarify `topn_aggregation_pruning_level` docs: level 0 disables Mode 2 entirely (not "direct compute only"), only Mode 1 can apply. - Split `03467_topn_aggregation` test into correctness file (allows random settings) and EXPLAIN file (keeps `no-random-settings`); drop `no-fasttest` tag. Made-with: Cursor

- Add `chassert` to verify column type consistency between input and boundary columns in `isBelowThreshold` and `buildThresholdKeepMask`. - Guard `refreshThresholdFromStates` against `limit==0` to prevent unsigned underflow in `std::min(limit, n) - 1`. - Add missing "max + any" reference queries in all three test files to verify companion aggregate correctness. - Add negative EXPLAIN cases for `min + DESC` (wrong direction) and multi-column `ORDER BY` to improve optimizer coverage. - Reset `allow_suspicious_low_cardinality_types` after table creation to avoid session pollution in test files. - Fix misleading "ties" comment (aggregate values are actually distinct). Made-with: Cursor

- Fix `TopKThresholdTracker::compareFields` to treat NaN the same as column-level code: NaN is ordered via `nulls_direction` instead of being treated as equal to all values by `Field::operator<`. This prevents the shared threshold from disagreeing with per-row pruning in `TopNDirectAggregatingTransform` and `FunctionTopKFilter`. - Reject Mode 1 (sorted-input early termination) when the determining aggregate argument is nullable. MergeTree physical NULL ordering may differ from the query's `NULLS FIRST`/`NULLS LAST`, so early termination after K groups could return wrong results. - Expand `optimize_topn_aggregation` setting description to document Mode 1 vs Mode 2 behavior, eligibility requirements, and current limitations (nullable, collation, float/NaN caveats). - Add regression tests for Float64 NaN with explicit `NULLS FIRST` and `NULLS LAST` under Mode 2, and for nullable-primary-key tables under Mode 1. Made-with: Cursor

Route floating-point types through the `executeGeneral` path in `FunctionTopKFilter` instead of the vectorized `greaterOrEquals` / `lessOrEquals` fast path. The vectorized comparisons return `false` for NaN, which can incorrectly drop rows that the transform-level pruning (using `compareColumn` with `nulls_direction`) would keep. Add regression tests for Float64 NaN at `topn_aggregation_pruning_level = 2` with `use_top_k_dynamic_filtering = 1`. Made-with: Cursor

Document the TopN aggregation optimization in the GROUP BY reference page: Mode 1 vs Mode 2 behavior, eligibility requirements, current limitations, example query, and related settings. Made-with: Cursor

The correctness test uses `COLLATE 'en'` which requires ICU, not available in the fast test environment. Made-with: Cursor

… columns The flaky check with random settings (e.g. `max_block_size=1`) can produce `ColumnConst` or other wrapped column types that aggregate functions cannot handle via `assert_cast`. Use `convertToFullIfNeeded` (which handles Const, Replicated, Sparse, and LowCardinality wrappers) instead of only converting LowCardinality, matching what the standard `Aggregator` does. Made-with: Cursor

- Unwrap key columns with `convertToFullIfNeeded` (like aggregate arg columns) so that `insertFrom` receives concrete column types instead of `ColumnConst`/`ColumnSparse` wrappers that cause `assert_cast` failures. Applied to Mode 1, Mode 2, and merge transform. - Add try/catch rollback in `createAggregateStates` matching the standard `Aggregator` pattern: if `create` throws for aggregate j, destroy already-created states 0..j-1. - Add try/catch in Mode 1 `consume` around temporary state usage so `destroyAggregateStates` is called even if `addRowToAggregateStates` or `insertResultsFromStates` throws. - Move `++num_groups` before `addRowToAggregateStates` in Mode 2 so `accumulated_keys.size()` and `num_groups` stay consistent if `add` throws after `group_states.push_back`. Made-with: Cursor

murphy-4o added 4 commits March 3, 2026 09:16

murphy-4o marked this pull request as draft March 3, 2026 12:10

clickhouse-gh bot added the pr-experimental Experimental Feature label Mar 3, 2026

clickhouse-gh bot added pr-performance Pull request with some performance improvements and removed pr-experimental Experimental Feature labels Mar 4, 2026

murphy-4o added 4 commits March 5, 2026 02:18

Revert .claude/learnings.md changes

3c206b3

Made-with: Cursor

Add no-fasttest tag to TopN aggregation test

f25cb46

The test uses `COLLATE 'en'` which requires ICU. The fast test build is compiled without ICU, causing `SUPPORT_IS_DISABLED` error. Made-with: Cursor

murphy-4o requested a review from Copilot March 5, 2026 04:43

Copilot started reviewing on behalf of murphy-4o March 5, 2026 04:44 View session

This comment was marked as resolved.

Sign in to view

murphy-4o added 8 commits March 5, 2026 05:55

Remove topngroupby.md from repository tracking

046006c

Made-with: Cursor

murphy-4o marked this pull request as ready for review March 5, 2026 09:53

This comment was marked as resolved.

Sign in to view

fastio mentioned this pull request Mar 16, 2026

Fix SIGSEGV in SortCursor JIT call under UBSan #99563

Closed

1 task

murphy-4o force-pushed the murphy_issue_75098 branch from 8e47d2c to 3c11938 Compare March 21, 2026 12:15

clickhouse-gh bot reviewed Mar 21, 2026

View reviewed changes

src/Processors/Transforms/TopNAggregatingTransform.cpp Outdated Show resolved Hide resolved

Move new TopN aggregation settings to version 26.4 in `SettingsChange…

dbbc0db

…sHistory` The settings were registered under 26.3 but the current development version is 26.4, causing `02995_new_settings_history` to fail. ClickHouse#98607 Co-Authored-By: Claude <noreply@anthropic.com>

clickhouse-gh bot reviewed Mar 21, 2026

View reviewed changes

murphy-4o and others added 3 commits March 22, 2026 14:20

Merge branch 'master' into murphy_issue_75098

801da69

Merge branch 'murphy_issue_75098' of github.com:murphy-4o/ClickHouse …

4ab1962

…into murphy_issue_75098

clickhouse-gh bot reviewed Mar 22, 2026

View reviewed changes

Merge branch 'master' into murphy_issue_75098

b2eb915

clickhouse-gh bot reviewed Mar 22, 2026

View reviewed changes

Merge branch 'master' into murphy_issue_75098

9413f22

clickhouse-gh bot reviewed Mar 22, 2026

View reviewed changes

nihalzp reviewed Mar 22, 2026

View reviewed changes

murphy-4o added 10 commits April 1, 2026 08:56

Merge branch 'master' into murphy_issue_75098

d4d8cb8

Add user documentation for TopN aggregation optimization

11bb4b9

Document the TopN aggregation optimization in the GROUP BY reference page: Mode 1 vs Mode 2 behavior, eligibility requirements, current limitations, example query, and related settings. Made-with: Cursor

Fix CI: add TopN/topn to aspell dictionary, restore no-fasttest tag

5fb3573

The correctness test uses `COLLATE 'en'` which requires ICU, not available in the fast test environment. Made-with: Cursor

		@@ -0,0 +1,648 @@
		-- Tags: no-random-settings, no-fasttest, no-parallel-replicas

Conversation

murphy-4o commented Mar 3, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changelog category:

Changelog entry:

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

This comment was marked as resolved.

Uh oh!

murphy-4o commented Mar 5, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

EmeraldShift commented Mar 21, 2026

Uh oh!

Uh oh!

murphy-4o commented Mar 21, 2026

Uh oh!

clickhouse-gh bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

nihalzp commented Mar 22, 2026

Uh oh!

nihalzp Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot commented Mar 22, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

murphy-4o commented Mar 3, 2026 •

edited by cursor bot

Loading

clickhouse-gh bot commented Mar 3, 2026 •

edited

Loading