Skip to content

Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K#98607

Open
murphy-4o wants to merge 47 commits intoClickHouse:masterfrom
murphy-4o:murphy_issue_75098
Open

Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K#98607
murphy-4o wants to merge 47 commits intoClickHouse:masterfrom
murphy-4o:murphy_issue_75098

Conversation

@murphy-4o
Copy link
Copy Markdown
Member

@murphy-4o murphy-4o commented Mar 3, 2026

Summary

This PR introduces a fused TopNAggregating query plan operator for queries like:

SELECT key, max(col) AS m
FROM table
GROUP BY key
ORDER BY m DESC
LIMIT K

It replaces the AggregatingStep -> SortingStep -> LimitStep chain with a single TopNAggregatingStep and supports two execution modes:

  • Mode 1 (sorted input / early termination): if table sort key matches the ORDER BY aggregate argument and ordering is compatible, read in order and stop after K unique groups.
  • Mode 2 (unsorted input / threshold pruning): aggregate directly with a dynamic K-th boundary, prune rows below boundary in transform, and optionally push boundary as __topKFilter to PREWHERE for storage-level skipping.

Closes #75098

Newly introduced settings

  • optimize_topn_aggregation (default 0): enables TopNAggregation rewrite.
  • topn_aggregation_pruning_level (default 2): controls Mode 2 pruning stack:
    • 0: direct compute only
    • 1: + in-transform threshold pruning
    • 2: + dynamic __topKFilter pushdown (when use_top_k_dynamic_filtering=1)
  • topn_aggregation_max_limit (default 1000): conservative Mode 2 gate; for larger LIMIT values optimizer falls back to standard aggregation/sort pipeline to avoid known large-K regressions.

Performance (short)

  • Mode 1: up to ~15x speedup on sorted input due to early termination.
  • Mode 2 (small K): typically ~2-5x in multi-thread and ~1.6-9.9x in single-thread cases.
  • Large K: can regress without gating; topn_aggregation_max_limit=1000 defaults to conservative fallback and avoids worst cases.

Changelog category:

  • Performance Improvement

Changelog entry:

Added optimize_topn_aggregation for fused GROUP BY ... ORDER BY aggregate LIMIT K execution, with topn_aggregation_pruning_level controls for Mode 2 pruning and topn_aggregation_max_limit for conservative large-K fallback.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Made with Cursor


Note

Medium Risk
Adds a new query-plan rewrite and execution operator (plus dynamic PREWHERE injection) that can change performance/correctness for eligible queries, though it is guarded by a new setting defaulting to off. Risk is mainly in plan-pattern matching and the new threshold-pruning path across nullable/collation edge cases (partly gated/disabled).

Overview
Adds an optional TopN aggregation optimization (optimize_topn_aggregation) that rewrites Limit -> Sorting -> Aggregating into a single TopNAggregatingStep for GROUP BY ... ORDER BY <aggregate> LIMIT K when all aggregates are compatible.

Extends IAggregateFunction with getTopKAggregateInfo() and implements it for min/max, any, and argMin/argMax to drive the rewrite and mode selection (sorted-input early termination vs unsorted threshold pruning with optional __topKFilter PREWHERE pushdown), controlled by new settings topn_aggregation_pruning_level and topn_aggregation_max_limit.

Updates __topKFilter to use inclusive comparisons, adds versioned/cached comparator building backed by a new TopKThresholdTracker version counter, exposes LimitStep::getOffset() for rewrite gating, and adds stateless correctness/stress tests covering positive and negative cases.

Written by Cursor Bugbot for commit b2e47a2. This will update automatically on new commits. Configure here.

…gregate LIMIT K

Add `TopNAggregatingStep` and `TopNAggregatingTransform` that fuse aggregation,
sorting, and limiting into a single pass for queries like:
  SELECT key, max(ts) AS m FROM t GROUP BY key ORDER BY m DESC LIMIT 5

The optimization replaces the Aggregating → Sorting → Limit plan chain with a
single `TopNAggregating` step. Two modes are supported:

- Mode 1 (sorted input): when the table's ORDER BY key matches the aggregate
  argument, the operator requests reverse reading from `ReadFromMergeTree` and
  stops after K unique groups — O(K) memory, minimal rows read.
- Mode 2 (unsorted input): processes all rows, aggregates per group, then
  partial-sorts to select the top K — eliminates the full sort.

Supported aggregates: `min`, `max`, `any`, `argMin`, `argMax`. Each declares its
compatibility via a new `getTopKAggregateInfo` virtual method on
`IAggregateFunction`. The optimization is gated by the
`optimize_topn_aggregation` setting (default off).

Made-with: Cursor
1. Block optimization when OFFSET > 0 — the optimizer now checks
   `LimitStep::getOffset` and bails out, preventing incorrect results
   for `LIMIT K OFFSET N` queries.

2. Eliminate per-row arena growth in key serialization — replaced
   `serializeValueIntoArena` (which allocated into the arena on every row)
   with a pure `SipHash`-128 key. Arena is now only used for aggregate
   state allocation, bounded by number of groups.

3. Preserve full `SortColumnDescription` — the optimizer now copies the
   original `nulls_direction` and `collator` from the `SortingStep`
   instead of rebuilding with defaults. Mode 2's `compareAt` uses the
   correct `nulls_direction` for NULL ordering.

4. Document mode-2 as intentional v1 simplification — the header comment
   now states that threshold-based pruning is not yet implemented.

5. Add test coverage for OFFSET (negative case) and NULLS FIRST/LAST
   (correctness comparison between optimized and unoptimized paths).

Made-with: Cursor
1. Collision-safe group keys: Replace hash-only `std::string` key with
   `UInt128` SipHash + exact column comparison via `unordered_multimap`.
   On hash match, compares actual key column values against stored groups
   in `result_columns`, eliminating the theoretical collision risk.

2. Collation-aware compare in Mode 2: `generateMode2` now checks for a
   `collator` in `SortColumnDescription` and uses `compareAtWithCollation`
   when present, ensuring correct ordering for string ORDER BY with
   explicit collation (e.g., COLLATE 'en').

3. Mode 1 safety for argMin/argMax/any: Add `output_ordered_by_sort_key`
   flag to `TopKAggregateInfo`. Only min/max set it to true (their output
   is monotonically related to the sort key). The optimizer now only
   enables Mode 1 (early termination with sorted input) when this flag is
   true, preventing incorrect results for argMin/argMax where the output
   column differs from the sort key.

4. Extended test coverage:
   - Collation-sensitive ordering (COLLATE 'en') optimized vs reference
   - Mode 1 EXPLAIN PLAN + correctness with small LIMIT
   - argMin/argMax ASC/DESC correctness parity
   - Tie-heavy dataset (100 groups, deterministic values)
   - Multi-aggregate (max + argMax) correctness
   - EXPLAIN PLAN for argMin optimization

Made-with: Cursor
The `.cpp` was out of sync with the `.h` header due to a file-write
tool failure in the previous session. The header declared the new
collision-safe API (`hashGroupKey`, `findGroupIndex`,
`unordered_multimap<UInt128, size_t, UInt128Hash>`), but the `.cpp`
still contained the old `serializeGroupKey` returning `std::string`
with hash-only identity.

This commit makes `.cpp` consistent with `.h`:

- `hashGroupKey`: returns `UInt128` SipHash digest of key columns.
- `findGroupIndex`: on hash match, performs exact column-by-column
  comparison against stored keys in `result_columns` via `compareAt`,
  eliminating the hash-collision correctness risk.
- Both `consumeMode1` and `consumeMode2` now use the new
  `unordered_multimap::emplace` / `equal_range` API.

Made-with: Cursor
@murphy-4o murphy-4o marked this pull request as draft March 3, 2026 12:10
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 3, 2026

Workflow [PR], commit [0320b44]

Summary:

job_name test_name status info comment
Stateless tests (arm_asan_ubsan, flaky check) failure
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
Stateless tests (amd_asan_ubsan, flaky check) failure
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
Stateless tests (amd_tsan, flaky check) failure
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
Stateless tests (amd_msan, flaky check) failure
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
Stateless tests (amd_debug, flaky check) failure
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
03467_topn_aggregation FAIL cidb
Stateless tests (amd_debug, parallel) failure
03467_topn_aggregation FAIL cidb

AI Review

Summary

This PR introduces the TopNAggregating rewrite/operator for GROUP BY ... ORDER BY aggregate LIMIT K, plus related settings, docs, and tests. I reviewed the changed C++ optimizer/transform paths, aggregate metadata wiring, threshold/filter logic, and tests for correctness/safety/performance risks. I did not find any new high-confidence blocker or major issue that is not already covered by existing clickhouse-gh[bot] inline comments.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate Feature is behind optimize_topn_aggregation (default off).
No magic constants
Backward compatibility New behavior is opt-in via settings; SettingsChangesHistory.cpp updated.
SettingsChangesHistory.cpp
PR metadata quality Changelog category/entry match the change; docs were added in this PR.
Safe rollout
Compilation time No obvious high-fan-out header bloat beyond lightweight metadata additions.
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-experimental Experimental Feature label Mar 3, 2026
Key changes:

1. Periodic threshold refresh: After each chunk, `refreshThresholdFromStates`
   calls `insertResultInto` for all groups' ORDER BY aggregate, uses
   `IColumn::getPermutation` with limit K to find the K-th best actual
   aggregate value, and sets a tight boundary. This raises the threshold
   from the ~1st percentile (first-seen values) to the ~99.99th percentile
   (actual group maxes), enabling 95-99% of subsequent rows to be skipped.
   Capped at `limit * 10000` groups to prevent O(N_groups) blowup.

2. Mode 2 gating: Only applies when a `__topKFilter` prewhere can be pushed
   into `ReadFromMergeTree` (numeric, non-nullable aggregate argument, no
   existing prewhere). Without prewhere, falls through to the standard
   Aggregator pipeline which is faster due to type-dispatched hashing.

3. Mode 1 gating fix: Correctly requires `output_ordered_by_sort_key` for
   Mode 1 (early termination), preventing incorrect results for `argMin`/
   `argMax` whose output has different sort order than the sort key.

Performance results on 10M rows, 100K groups:
- Mode 2 unsorted (uniform): 7ms vs 13ms baseline (1.8x faster, was 3.7x slower)
- Mode 2 skewed (50M rows): 9ms vs 26ms baseline (2.9x faster)
- Mode 2 memory: 4 MiB vs 271 MiB (67x less)
- High cardinality (5M+ groups): 4-5x slower (known limitation)
- Non-MergeTree: zero regression (falls through to standard pipeline)

Made-with: Cursor
@clickhouse-gh clickhouse-gh bot added pr-performance Pull request with some performance improvements and removed pr-experimental Experimental Feature labels Mar 4, 2026
…, comparator caching

Major changes to the TopNAggregation unsorted-input path (Mode 2):

- Replace old `consumeMode2` (find + insert, per-row threshold check from
  aggregate states, periodic `refreshThresholdFromStates`) with a simpler
  direct loop using `HashMap::emplace`, per-chunk threshold estimation from
  raw input data, and `TopKThresholdTracker::testAndSet`.
- Accumulate keys in dedicated `mode2_accumulated_keys` columns instead of
  reusing `result_columns`, avoiding confusion between Mode 1 and Mode 2.
- `generateMode2Partial` emits only local top-K groups as intermediate
  aggregate state columns (ColumnAggregateFunction), drastically reducing
  fan-in to the merge transform.
- Rewrite `TopNAggregatingMergeTransform`: pre-compute `agg_state_offsets`,
  `key_column_indices`, `agg_column_indices`; use `ArenaPtr` instead of
  value `Arena`; inline hash/create/destroy instead of separate methods.
- Simplify destructor to unconditionally sweep `group_states`.
- Remove unused `refreshThresholdFromStates`, `isBelowThreshold`,
  `order_agg_arg_col_idx`, `threshold_active` members.

Mode 1 improvements:
- Use `MergingSortedTransform` to merge N sorted streams instead of
  `pipeline.resize(1)`, preserving parallelism up to the merge point.
- Pass `order_arg_col_name` through `TopNAggregatingStep` for sort desc.

Optimizer (`optimizeTopNAggregation`):
- Add `group_by_matches_sorting_prefix` guard: skip Mode 2 when GROUP BY
  keys match a prefix of the table sorting key (standard Aggregator is
  faster in this case due to two-level hash table with sorted output).
- Add `unqualifyColumnName` helper for robust key matching.

`FunctionTopKFilter`:
- Cache the built comparator function + converted threshold, keyed by a
  monotonic version counter on `TopKThresholdTracker`. Rebuilds only when
  the threshold actually changes, avoiding redundant `build` + `convertFieldToType`.
- Fix comparator to use `lessOrEquals`/`greaterOrEquals` (was `less`/`greater`),
  so rows exactly equal to the threshold are kept (required for correctness
  when the boundary group's argument equals the threshold).

`TopKThresholdTracker`:
- Add `version` atomic counter, incremented on every `testAndSet` that
  actually updates the threshold. Enables lock-free staleness detection.

Made-with: Cursor
1. Collision-safe group identity: replace `HashMap<UInt128, size_t>`
   (SipHash-only key) with `HashMapWithSavedHash<std::string_view, size_t>`
   backed by `IColumn::serializeValueIntoArena`. This gives exact key
   comparison with automatic arena rollback for non-new keys via
   `SerializedKeyHolder`, eliminating the theoretical hash-collision
   correctness risk.

2. Safe threshold pruning: remove the per-chunk row-level threshold
   update from `consumeMode2`. Row-level K-th values can be stricter
   than the true group-level K-th aggregate, which would incorrectly
   filter rows needed by groups in the final top K. The threshold is
   now updated only from group-level aggregates in
   `generateMode2Partial`, which is provably safe (a partial's K-th
   group aggregate is always <= the global K-th).

3. Fix Mode 2 gating: remove the `group_by_matches_sorting_prefix`
   check that blocked Mode 2 even when Mode 1 was inapplicable (e.g.
   GROUP BY matches the sorting key but the ORDER BY aggregate argument
   is a different column). The `!sorted_input` guard already suffices.

4. Add documentation comments explaining why Mode 2 is intentionally
   not applied as a generic unsorted fallback, and noting that Mode 2
   memory is O(unique groups) without in-stream eviction.

5. Add stress test for large composite keys, skewed ties, and parallel
   Mode 2 threshold pruning.

Made-with: Cursor
The test uses `COLLATE 'en'` which requires ICU. The fast test
build is compiled without ICU, causing `SUPPORT_IS_DISABLED` error.

Made-with: Cursor

This comment was marked as resolved.

Add a new `topn_aggregation_pruning_level` (UInt64, 0-2) setting to
control Mode 2 optimizations independently:
  - Level 0: direct compute only (no threshold, no filter)
  - Level 1: + in-transform threshold pruning
  - Level 2: + dynamic `__topKFilter` prewhere pushdown

The dynamic filter (level 2) now respects `use_top_k_dynamic_filtering`,
which defaults to false, so it is no longer injected unconditionally.

Re-add in-transform threshold pruning (`refreshThresholdFromStates`,
`isBelowThreshold`) to `TopNAggregatingTransform` for level 1+ support.

Test changes:
- Add correctness tests for all three pruning levels on MergeTree table
- Add EXPLAIN tests verifying plan differences across levels
- Existing EXPLAIN prewhere test now explicitly sets
  `use_top_k_dynamic_filtering = 1`

Made-with: Cursor
The static analyzer cannot prove that `enable_threshold_pruning` being
true guarantees `order_arg_col` is non-null. Check the pointer directly
instead of the boolean flag to satisfy clang-analyzer-core.NonNullParamChecker.

Made-with: Cursor
- Setting description for `topn_aggregation_pruning_level` now explicitly
  states that level 2 requires `use_top_k_dynamic_filtering` to inject
  the prewhere, and that level 0 is not recommended (slower than baseline).

- Mode 2 now requires `pruning_level >= 1` to activate. Level 0 (direct
  compute without threshold) is known to regress vs the standard Aggregator
  pipeline, so it should not silently activate.

- EXPLAIN reference for level 0 updated: now shows the standard
  Aggregating + Sorting + Limit pipeline (no TopNAggregating).

Made-with: Cursor
Pass 0 (no limit) to `MergingSortedTransform` instead of the user's
LIMIT K. K refers to the number of distinct groups, not total rows.
With the previous code, if K consecutive rows belonged to the same group,
MergingSorted would stop after K rows and the downstream transform would
see fewer than K distinct groups.

The backpressure from `TopNAggregatingTransform::consumeMode1` calling
`finishConsume` is sufficient to stop reading once K groups are found.

Made-with: Cursor
Header: regroup members into logical sections (configuration, column
indices, state layout, per-group state, Mode 1/2 columns, threshold).

Implementation: extract shared helpers to eliminate duplication:
- `computeStateLayout`: state offset/alignment (was duplicated in both
  constructors)
- `findOrderByAggIndex`: ORDER BY aggregate lookup (was duplicated in
  both constructors and `initColumnIndices`)
- `sortAndLimitColumns`: sort-permute-limit pattern (was triplicated in
  `generateMode2`, `generateMode2Partial`, merge `generate`)
- `prepareArgColumnPtrs`: arg column pointer setup (was duplicated in
  `consumeMode1` and `consumeMode2`)

Also make `consumeMode1` use the shared aggregate state helpers
(`createAggregateStates`/`addRowToAggregateStates`/etc.) instead of
inline state management, and deduplicate the merge loop in
`TopNAggregatingMergeTransform::consume`.

Net: -20 lines, fewer copy-paste surfaces for future changes.
Made-with: Cursor
Avoid large-`LIMIT` regressions by adding a conservative `topn_aggregation_max_limit` gate, and add `EXPLAIN` coverage that verifies default fallback and explicit override behavior. Also reduce threshold refresh overhead in `TopNAggregatingTransform` with adaptive cadence and document `GroupState` intent for future per-group metadata.

Made-with: Cursor
Add `buildThresholdKeepMask` to `TopNAggregatingTransform` which uses
`IColumn::compareColumn` for vectorized threshold filtering in Mode 2
instead of per-row `isBelowThreshold` calls. This amortizes virtual
call overhead and enables SIMD-friendly comparison paths.

Add "Design tradeoffs" and "Future improvements" sections to
`topngroupby.md` covering: serialized-key HashMap vs type-dispatched
tables, threshold refresh cost and adaptive backoff, dynamic filter
effectiveness by data distribution, partial top-K correctness, and
planned improvements (type-dispatched tables, group eviction,
mixed-aggregate two-pass, distributed support, etc.).

Made-with: Cursor
@murphy-4o murphy-4o marked this pull request as ready for review March 5, 2026 09:53
The test uses EXPLAIN to verify query plan structure, which changes
when ParallelReplicas is enabled (adds MergingAggregated/Union/
ReadFromRemoteParallelReplicas nodes). This caused the reference
output mismatch in the ParallelReplicas stateless test CI job.

Made-with: Cursor
@murphy-4o
Copy link
Copy Markdown
Member Author

@cursor review

cursor[bot]

This comment was marked as resolved.

…lumns

The companion aggregate compatibility loop only checked
`determined_by_first_row_direction` but never verified that each
companion's determining argument column matches the ORDER BY
aggregate's argument column.  This allowed, e.g.,
`argMax(payload, val)` alongside `max(ts)` (val != ts) to be
incorrectly optimized: Mode 1 only processes the first row per group
sorted by `ts`, missing the row with max `val`; Mode 2 threshold
pruning on `ts` can skip rows carrying the true max `val`.

Add a check that every non-`any` companion aggregate's determining
argument (`argument_names.back()`) equals the ORDER BY aggregate's
determining argument.  `any()` (`INT_MAX`) remains exempt since it
accepts any row by definition.

Made-with: Cursor
…ion tracking

Merge master's `compareFields`-based comparison (collation + nullable support)
with the branch's version counter for threshold change detection in
`TopKThresholdTracker`. Keep master's `executeGeneral`/`executeVectorized`
split in `FunctionTopKFilter`.

Co-Authored-By: Claude <noreply@anthropic.com>
@murphy-4o murphy-4o force-pushed the murphy_issue_75098 branch from 8e47d2c to 3c11938 Compare March 21, 2026 12:15
@EmeraldShift
Copy link
Copy Markdown
Contributor

When topn_aggregation_pruning_level is 2, does this use the TopKThreshold-based skip index filtering? Or only the PREWHERE one?

…structor

The constructor was changed to accept `SortColumnDescription` (for
collation/nullable support) but the call site in
`optimizeTopNAggregation` still passed a bare `int sort_direction`.

Co-Authored-By: Claude <noreply@anthropic.com>
@murphy-4o
Copy link
Copy Markdown
Member Author

When topn_aggregation_pruning_level is 2, does this use the TopKThreshold-based skip index filtering? Or only the PREWHERE one?

At topn_aggregation_pruning_level = 2, it uses both. The PREWHERE (__topKFilter) and the TopKThresholdTracker-based skip index filtering are independent mechanisms. The skip index
filtering is applied as best as possible, but has limitations:

  • Requires a minmax skip index on the ORDER BY aggregate argument column
  • Column must be numeric (isValueRepresentedByNumber)
  • Column must be non-nullable
  • No collation support

When those conditions aren't met, only the PREWHERE filtering is active.

Ref: #99033

…sHistory`

The settings were registered under 26.3 but the current development
version is 26.4, causing `02995_new_settings_history` to fail.

ClickHouse#98607

Co-Authored-By: Claude <noreply@anthropic.com>
/// Mode 1 requires output_ordered_by_sort_key because early termination depends
/// on the aggregate result being monotonically ordered with the sort key. For example,
/// argMin(payload, ts) returns payload which has different ordering than ts.
if (read_from_mt && order_info.output_ordered_by_sort_key)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mode 1 is enabled based only on sort direction, but this path does not validate SQL ordering semantics beyond direction.

At this point we only check output_ordered_by_sort_key and call requestReadingInOrder(1, required_direction, limit_value), while the original SortingStep can also carry COLLATE and NULLS FIRST/LAST semantics (sort_desc[0].collator / nulls_direction).

If query ORDER BY uses collation (and potentially nullable semantics), physical MergeTree read order is not guaranteed to match that order, so early termination after K groups can return wrong results.

Please add a strict guard before enabling sorted_input: at minimum reject this optimization when sort_desc[0].collator is set; and for nullable argument columns, only enable when null ordering is proven equivalent to storage read order.

murphy-4o and others added 3 commits March 22, 2026 14:20
- Fix `03470_topn_aggregation_parallel_replicas` test failure in "old
  analyzer" CI config: move `allow_experimental_analyzer = 1` from
  subquery SETTINGS to session-level SET, because changing this setting
  inside a subquery is disallowed when the top-level value differs.

- Fix potential overflow in `maybeRefreshThreshold`: replace
  `groups > limit * 10000` with `groups / 10000 > limit` to avoid
  wrapping when `limit` is large.

ClickHouse#98607

Co-Authored-By: Claude <noreply@anthropic.com>
order_col.getPermutation(
direction, IColumn::PermutationSortStability::Unstable,
output_limit, sort_direction, perm);
return perm;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TopNAggregatingTransform uses sort_direction as nan_direction_hint in threshold ordering (getPermutation and compareColumn), but final output sorting uses sort_description.front().nulls_direction.

For floating-point aggregates with explicit NULLS FIRST/LAST (which controls NaN placement), pruning can evaluate rows with a different ordering than the final ORDER BY, so rows can be incorrectly dropped/kept before final sort.

Please use the same nan_direction_hint as final sorting (sort_description.front().nulls_direction) in threshold-related comparisons/permutations.


- [GROUP BY optimization](/sql-reference/statements/select/group-by#group-by-optimization-depending-on-table-sorting-key)
)", 0) \
DECLARE(Bool, optimize_topn_aggregation, false, R"(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces new user-facing settings (optimize_topn_aggregation, topn_aggregation_pruning_level, topn_aggregation_max_limit) but the PR template still has documentation unchecked and there is no docs update in this PR.

For a new feature, please add user documentation (or explicitly include the docs text in the PR) and mark the docs checkbox accordingly.

UInt64 pruning_level = optimization_settings.topn_aggregation_pruning_level;
bool mode2_limit_ok = limit_value <= optimization_settings.topn_aggregation_max_limit;

if (!sorted_input && mode2_limit_ok && pruning_level >= 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topn_aggregation_pruning_level=0 is documented as “direct compute only” (Settings.cpp), but this branch requires pruning_level >= 1, so Mode 2 never activates at level 0.

That makes the setting behavior inconsistent with user-facing docs and with the comment above (level 0 — direct compute only). Please either:

  1. allow Mode 2 at level 0 (without threshold/pruning), or
  2. update the setting docs/comments to state that level 0 disables Mode 2 entirely.

@nihalzp
Copy link
Copy Markdown
Member

nihalzp commented Mar 22, 2026

Briefly looking at the code, it seemed to me that mode1 and mode2 does not share too much logic in common (maybe a small base class for common code?). I think it will be more readable if we have two different transforms (with good descriptive names) instead. Then, we can dispatch them via TopNAggregatingStep; this seems natural to me because TopNAggregatingStep already knows the mode at construction.

@@ -0,0 +1,648 @@
-- Tags: no-random-settings, no-fasttest, no-parallel-replicas
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Tags: no-random-settings

I understand why we have this setting to make sure that EXPLAIN queries do not become flaky. However, random settings are quite useful in finding bugs.

That's why I propose that we can have two test file: this one and equivalent 03467_topn_aggregation_explain.sql. 03467_topn_aggregation.sql can have all the correctness tests. Then, for each test 03467_topn_aggregation_explain.sql will verify that TopNAggregatingTransform is actually used. We can keep no-random-settings only for the explain test.

no-fasttest

I think we can drop it. I see no reason to keep it.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 22, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.90% 84.00% +0.10%
Functions 24.80% 24.70% -0.10%
Branches 76.50% 76.50% +0.00%

PR changed lines: PR changed-lines coverage: 84.40% (952/1128, 1 noise lines excluded)
Diff coverage report
Uncovered code

murphy-4o added 10 commits April 1, 2026 08:56
The monolithic `TopNAggregatingTransform` handled both sorted-input
early-termination (Mode 1) and hash-based direct aggregation (Mode 2)
with runtime dispatch.  These modes share little logic beyond column
index mapping and aggregate state lifecycle.

Split into three classes:

- `TopNAggregatingTransformBase` -- base class with shared infrastructure
  (column indices, key serialization, aggregate state helpers)
- `TopNSortedAggregatingTransform` -- Mode 1: sorted input, stops after
  K distinct groups
- `TopNDirectAggregatingTransform` -- Mode 2: hash aggregation with
  optional threshold pruning, supports partial output for parallel workers

`TopNAggregatingStep` now dispatches to the appropriate transform class
at construction time, which is natural since it already knows the mode.
`TopNAggregatingMergeTransform` is unchanged.

Made-with: Cursor
- Fix NaN direction hint: use `nulls_direction` from `sort_description`
  consistently in `getSortPermutation`, `isBelowThreshold`, and
  `buildThresholdKeepMask` instead of `sort_direction`, matching the
  final output sorting semantics.

- Add Mode 1 collation guard: reject sorted-input early termination
  when `sort_desc[0].collator` is set, because physical MergeTree
  order may not match collation order.

- Clarify `topn_aggregation_pruning_level` docs: level 0 disables
  Mode 2 entirely (not "direct compute only"), only Mode 1 can apply.

- Split `03467_topn_aggregation` test into correctness file (allows
  random settings) and EXPLAIN file (keeps `no-random-settings`);
  drop `no-fasttest` tag.

Made-with: Cursor
- Add `chassert` to verify column type consistency between input and
  boundary columns in `isBelowThreshold` and `buildThresholdKeepMask`.

- Guard `refreshThresholdFromStates` against `limit==0` to prevent
  unsigned underflow in `std::min(limit, n) - 1`.

- Add missing "max + any" reference queries in all three test files
  to verify companion aggregate correctness.

- Add negative EXPLAIN cases for `min + DESC` (wrong direction) and
  multi-column `ORDER BY` to improve optimizer coverage.

- Reset `allow_suspicious_low_cardinality_types` after table creation
  to avoid session pollution in test files.

- Fix misleading "ties" comment (aggregate values are actually distinct).

Made-with: Cursor
- Fix `TopKThresholdTracker::compareFields` to treat NaN the same as
  column-level code: NaN is ordered via `nulls_direction` instead of
  being treated as equal to all values by `Field::operator<`. This
  prevents the shared threshold from disagreeing with per-row pruning
  in `TopNDirectAggregatingTransform` and `FunctionTopKFilter`.

- Reject Mode 1 (sorted-input early termination) when the determining
  aggregate argument is nullable. MergeTree physical NULL ordering may
  differ from the query's `NULLS FIRST`/`NULLS LAST`, so early
  termination after K groups could return wrong results.

- Expand `optimize_topn_aggregation` setting description to document
  Mode 1 vs Mode 2 behavior, eligibility requirements, and current
  limitations (nullable, collation, float/NaN caveats).

- Add regression tests for Float64 NaN with explicit `NULLS FIRST`
  and `NULLS LAST` under Mode 2, and for nullable-primary-key tables
  under Mode 1.

Made-with: Cursor
Route floating-point types through the `executeGeneral` path in
`FunctionTopKFilter` instead of the vectorized `greaterOrEquals` /
`lessOrEquals` fast path. The vectorized comparisons return `false`
for NaN, which can incorrectly drop rows that the transform-level
pruning (using `compareColumn` with `nulls_direction`) would keep.

Add regression tests for Float64 NaN at `topn_aggregation_pruning_level = 2`
with `use_top_k_dynamic_filtering = 1`.

Made-with: Cursor
Document the TopN aggregation optimization in the GROUP BY reference
page: Mode 1 vs Mode 2 behavior, eligibility requirements, current
limitations, example query, and related settings.

Made-with: Cursor
The correctness test uses `COLLATE 'en'` which requires ICU, not
available in the fast test environment.

Made-with: Cursor
… columns

The flaky check with random settings (e.g. `max_block_size=1`) can produce
`ColumnConst` or other wrapped column types that aggregate functions cannot
handle via `assert_cast`. Use `convertToFullIfNeeded` (which handles Const,
Replicated, Sparse, and LowCardinality wrappers) instead of only converting
LowCardinality, matching what the standard `Aggregator` does.

Made-with: Cursor
- Unwrap key columns with `convertToFullIfNeeded` (like aggregate arg
  columns) so that `insertFrom` receives concrete column types instead
  of `ColumnConst`/`ColumnSparse` wrappers that cause `assert_cast`
  failures. Applied to Mode 1, Mode 2, and merge transform.

- Add try/catch rollback in `createAggregateStates` matching the
  standard `Aggregator` pattern: if `create` throws for aggregate j,
  destroy already-created states 0..j-1.

- Add try/catch in Mode 1 `consume` around temporary state usage so
  `destroyAggregateStates` is called even if `addRowToAggregateStates`
  or `insertResultsFromStates` throws.

- Move `++num_groups` before `addRowToAggregateStates` in Mode 2 so
  `accumulated_keys.size()` and `num_groups` stay consistent if `add`
  throws after `group_states.push_back`.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Top K" style query with simple aggregation function reads many rows and uses a lot of memory

7 participants