Skip to content

Support non-numeric & nullable types for FunctionTopKFilter#99033

Merged
murphy-4o merged 13 commits intoClickHouse:masterfrom
murphy-4o:murphy_issue_99024
Mar 20, 2026
Merged

Support non-numeric & nullable types for FunctionTopKFilter#99033
murphy-4o merged 13 commits intoClickHouse:masterfrom
murphy-4o:murphy_issue_99024

Conversation

@murphy-4o
Copy link
Copy Markdown
Member

@murphy-4o murphy-4o commented Mar 9, 2026

Closes #99024

The ORDER BY ... LIMIT top-k dynamic filtering optimization (use_top_k_dynamic_filtering) was previously restricted to non-nullable numeric types. This PR generalizes FunctionTopKFilter and TopKThresholdTracker to support:

  • Nullable(T) with correct NULLS FIRST / NULLS LAST semantics
  • String and Nullable(String)
  • COLLATE-aware string ordering

TODO
The skip-index top-k path (use_skip_indexes_for_top_k) remains restricted to non-nullable numeric types because its MinMaxGranuleItem::operator< / getTopKMarks use raw Field comparison that does not respect nulls_direction or collation (left as a TODO).

Perf Result

  • Dataset: hits_nullable_full (39M rows, MergeTree ORDER BY CounterID)
  • Query pattern: SELECT col FROM hits_nullable_full ORDER BY col LIMIT 100
# Type OFF (s) ON (s) Speedup
1 UInt64 WatchID 0.009 0.009 1.00x
2 String URL 0.082 0.069 1.19x
3 String Title 0.074 0.072 1.03x
4 Nullable(UInt64) WatchID_N 0.035 0.014 2.50x
5 Nullable(String) URL_N 0.239 0.076 3.14x
6 Nullable(String) Title_N 0.203 0.079 2.57x

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Generalize ORDER BY ... LIMIT top-k dynamic filtering to support Nullable, String, and COLLATE types.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

…ation types

`FunctionTopKFilter` and `TopKThresholdTracker` previously only supported
numeric non-nullable types, using simplified direction-only comparisons
(`less`/`greater`). This limited `use_top_k_dynamic_filtering` to a small
subset of ORDER BY ... LIMIT queries.

This change:

- Extends `TopKThresholdTracker` to carry `nulls_direction` and an optional
  `Collator`, using a `compareFields` method with proper NULL and
  collation-aware semantics matching the real sort path.

- Adds nullable and collation execution paths to `FunctionTopKFilter`:
  the fast vectorized `less`/`greater` path is kept for non-nullable
  non-collation types; a per-element `compareAt` path handles Nullable
  columns with correct `nulls_direction`; a `compareAtWithCollation` path
  handles collation-aware string ordering.

- Sets `useDefaultImplementationForNulls` to false so that NULL inputs
  produce correct boolean filter results instead of NULL (which PREWHERE
  interprets as false).

- Removes the `isValueRepresentedByNumber() || isNullable()` gate in
  `optimizeTopK`, passing full sort semantics (`nulls_direction`, collator)
  from the `SortColumnDescription` to the threshold tracker.

Closes ClickHouse#99024

Made-with: Cursor
The skip-index top-k path (getTopKMarks / MinMaxGranuleItem::operator<)
ranks granules using raw Field comparison, which does not respect
`nulls_direction` or collation semantics. This could cause incorrect
mark skipping for ORDER BY ... NULLS FIRST/LAST or COLLATE queries.

Restrict skip-index top-k to non-nullable numeric types without
collation, where raw Field ordering matches ORDER BY semantics. The
dynamic filtering path (use_top_k_dynamic_filtering) remains fully
available for all types since it uses the corrected comparators in
`TopKThresholdTracker` and `FunctionTopKFilter`.

Add test 04034_top_k_skip_index_nullable_guard covering:
- EXPLAIN verification that skip-index top-k does NOT activate for
  Nullable(UInt32), String with COLLATE, or Nullable(String) with
  COLLATE
- EXPLAIN verification that skip-index top-k still activates for plain
  UInt32
- Dynamic filtering correctness for all the above types with various
  ASC/DESC/NULLS FIRST orderings

Made-with: Cursor
When only `use_skip_indexes_for_top_k` is enabled but the sort column
type is nullable, non-numeric, or uses a collator, `tryOptimizeTopK`
would still call `setTopKColumn` which unnecessarily disables the
query-condition cache.

Add type eligibility checks in `tryOptimizeTopK` (mirroring the guard
in `ReadFromMergeTree::buildIndexes`) so that `setTopKColumn` is only
called when at least one optimization path (skip-index or dynamic
filtering) will actually activate. This avoids performance side effects
from cache disablement on ineligible queries.

Made-with: Cursor
The skip-index top-k path (MinMaxGranuleItem::operator<, getTopKMarks)
uses raw Field comparison that does not respect nulls_direction or
collation. Add a TODO to generalize the comparison so the type
restriction can be lifted.

Made-with: Cursor
Add SELECT 'label' lines before each test query so the reference files
are self-documenting and easier to review.

Made-with: Cursor
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 9, 2026

Workflow [PR], commit [af9af49]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 9, 2026
The fast test build does not include ICU, so `COLLATE 'en'`
queries fail. Mark both top-k tests as `no-fasttest`.

Made-with: Cursor
…filter

The vectorized path incorrectly used strict `less`/`greater` instead of
`lessOrEquals`/`greaterOrEquals` as on master. The nullable and collation
paths had the same issue with `< 0` instead of `<= 0`. Rows equal to the
threshold must pass the filter to ensure correctness of ORDER BY ... LIMIT.

Also fix `isValueInsideThreshold` to treat equal values as inside the
threshold (matching master's semantics).

Made-with: Cursor
Instead of per-element virtual `compareAt` calls, unwrap the
ColumnNullable, run the vectorized `lessOrEquals`/`greaterOrEquals`
on the nested column, then patch NULL positions with a precomputed
constant derived from direction and nulls_direction.

Made-with: Cursor
- Remove the dead else branch in the collation path (compareAt without
  collation was unreachable since the method is only called when
  collator is set).
- Rename executeWithColumnCompare to executeWithCollation to reflect
  its actual purpose.
- Remove the use_general_comparison field; check collator directly.

Made-with: Cursor
…ldTracker

Replace the three separate fields (direction, nulls_direction, collator)
in `TopKThresholdTracker` with a single `SortColumnDescription`, reducing
redundancy while keeping the public getter API unchanged.

In `FunctionTopKFilter`, replace the complex `executeNullable` path (manual
nullable unwrapping + null patching) and separate `executeWithCollation`
path with a single `executeGeneral` that uses `compareColumn` for
CRTP-devirtualized column-level comparison (non-collation) and per-element
`compareAtWithCollation` for the collation case.

Made-with: Cursor
@murphy-4o
Copy link
Copy Markdown
Member Author

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@murphy-4o murphy-4o marked this pull request as ready for review March 10, 2026 12:10
@shankar-iyer shankar-iyer self-assigned this Mar 16, 2026
@rschu1ze
Copy link
Copy Markdown
Member

@murphy-4o Please include some a performance improvement measurement in the PR message.

@shankar-iyer
Copy link
Copy Markdown
Member

@murphy-4o Please include some a performance improvement measurement in the PR message.

Yes!

We could add a performance test but the hits table only has a URL column of String type. We needed some other columns like email_id, api_path etc. The optimization will be useful when the ORDER BY ... LIMIT n is on a String column and there are other heavy predicates (on other String/JSON columns). Please check the effect on query latency with -

SELECT URL, EventTime FROM hits ORDER BY URL settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%yandex%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%google%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%xyzxyz%' settings use_top_k_dynamic_filtering = 1

The github_events table could be tried to see improvements in queries like

SELECT ... FROM github_events WHERE body ILIKE '%hack%' OR body ILIKE '%unsafe%' ORDER BY actor_login LIMIT 100 settings use_top_k_dynamic_filtering = 1

Again, we need many rows to pass the WHERE clause, else theuse_top_k_dynamic_filtering = 1could be a overhead. Ref : #99537

- Remove `skip_index->index.granularity != 1` guard: skip indexes with
  granularity > 1 are now supported (ref ClickHouse#93545).
- Trim test tags to just `no-fasttest`; 10000 rows is fine for sanitizers.

Made-with: Cursor
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 18, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.70% 83.70% +0.00%
Functions 23.90% 23.90% +0.00%
Branches 76.30% 76.20% -0.10%

PR changed lines: PR changed-lines coverage: 81.67% (147/180, 1 noise lines excluded)
Diff coverage report
Uncovered code

@murphy-4o
Copy link
Copy Markdown
Member Author

@murphy-4o Please include some a performance improvement measurement in the PR message.

Yes!

We could add a performance test but the hits table only has a URL column of String type. We needed some other columns like email_id, api_path etc. The optimization will be useful when the ORDER BY ... LIMIT n is on a String column and there are other heavy predicates (on other String/JSON columns). Please check the effect on query latency with -

SELECT URL, EventTime FROM hits ORDER BY URL settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%yandex%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%google%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%xyzxyz%' settings use_top_k_dynamic_filtering = 1

The github_events table could be tried to see improvements in queries like

SELECT ... FROM github_events WHERE body ILIKE '%hack%' OR body ILIKE '%unsafe%' ORDER BY actor_login LIMIT 100 settings use_top_k_dynamic_filtering = 1

Again, we need many rows to pass the WHERE clause, else theuse_top_k_dynamic_filtering = 1could be a overhead. Ref : #99537

@rschu1ze @shankar-iyer attached performance result

@murphy-4o murphy-4o changed the title add support for non-numeric & nullable types for FunctionTopKFilter Support non-numeric & nullable types for FunctionTopKFilter Mar 19, 2026
@shankar-iyer
Copy link
Copy Markdown
Member

@murphy-4o Please include some a performance improvement measurement in the PR message.

Yes!
We could add a performance test but the hits table only has a URL column of String type. We needed some other columns like email_id, api_path etc. The optimization will be useful when the ORDER BY ... LIMIT n is on a String column and there are other heavy predicates (on other String/JSON columns). Please check the effect on query latency with -

SELECT URL, EventTime FROM hits ORDER BY URL settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%yandex%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%google%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%xyzxyz%' settings use_top_k_dynamic_filtering = 1

The github_events table could be tried to see improvements in queries like
SELECT ... FROM github_events WHERE body ILIKE '%hack%' OR body ILIKE '%unsafe%' ORDER BY actor_login LIMIT 100 settings use_top_k_dynamic_filtering = 1
Again, we need many rows to pass the WHERE clause, else theuse_top_k_dynamic_filtering = 1could be a overhead. Ref : #99537

@rschu1ze @shankar-iyer attached performance result

Thanks, nice results!

@murphy-4o murphy-4o added this pull request to the merge queue Mar 20, 2026
Merged via the queue into ClickHouse:master with commit b949c22 Mar 20, 2026
951 of 958 checks passed
@murphy-4o murphy-4o deleted the murphy_issue_99024 branch March 20, 2026 05:32
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 20, 2026
alexey-milovidov added a commit that referenced this pull request Mar 20, 2026
When the sort column in `ORDER BY ... LIMIT` matches the first column of
the storage's sorting key, the read-in-order optimization reads data in
sorted order and relies on early pipeline cancellation via `LIMIT` to
minimize `rows_read`.

The TopK dynamic filtering (prewhere `__topKFilter`) is counterproductive
in this case: once the threshold is established after the first few rows,
the prewhere rejects all subsequent rows (they are beyond the threshold
in sorted order), preventing early cancellation and causing a full table
scan.

This was exposed by PR #99033 which removed the type restriction that
previously prevented TopK optimization for String columns.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b949c229302bf2d8153647c38452ddffbb04cb66&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20distributed%20plan%2C%20parallel%2C%202%2F2%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FunctionTopKFilter only implements numeric, non-nullable threshold semantics

4 participants