Support non-numeric & nullable types for FunctionTopKFilter by murphy-4o · Pull Request #99033 · ClickHouse/ClickHouse

murphy-4o · 2026-03-09T06:43:29Z

The ORDER BY ... LIMIT top-k dynamic filtering optimization (use_top_k_dynamic_filtering) was previously restricted to non-nullable numeric types. This PR generalizes FunctionTopKFilter and TopKThresholdTracker to support:

Nullable(T) with correct NULLS FIRST / NULLS LAST semantics
String and Nullable(String)
COLLATE-aware string ordering

TODO
The skip-index top-k path (use_skip_indexes_for_top_k) remains restricted to non-nullable numeric types because its MinMaxGranuleItem::operator< / getTopKMarks use raw Field comparison that does not respect nulls_direction or collation (left as a TODO).

Perf Result

Dataset: hits_nullable_full (39M rows, MergeTree ORDER BY CounterID)
Query pattern: SELECT col FROM hits_nullable_full ORDER BY col LIMIT 100

#	Type	OFF (s)	ON (s)	Speedup
1	UInt64 WatchID	0.009	0.009	1.00x
2	String URL	0.082	0.069	1.19x
3	String Title	0.074	0.072	1.03x
4	Nullable(UInt64) WatchID_N	0.035	0.014	2.50x
5	Nullable(String) URL_N	0.239	0.076	3.14x
6	Nullable(String) Title_N	0.203	0.079	2.57x

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Generalize ORDER BY ... LIMIT top-k dynamic filtering to support Nullable, String, and COLLATE types.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

…ation types `FunctionTopKFilter` and `TopKThresholdTracker` previously only supported numeric non-nullable types, using simplified direction-only comparisons (`less`/`greater`). This limited `use_top_k_dynamic_filtering` to a small subset of ORDER BY ... LIMIT queries. This change: - Extends `TopKThresholdTracker` to carry `nulls_direction` and an optional `Collator`, using a `compareFields` method with proper NULL and collation-aware semantics matching the real sort path. - Adds nullable and collation execution paths to `FunctionTopKFilter`: the fast vectorized `less`/`greater` path is kept for non-nullable non-collation types; a per-element `compareAt` path handles Nullable columns with correct `nulls_direction`; a `compareAtWithCollation` path handles collation-aware string ordering. - Sets `useDefaultImplementationForNulls` to false so that NULL inputs produce correct boolean filter results instead of NULL (which PREWHERE interprets as false). - Removes the `isValueRepresentedByNumber() || isNullable()` gate in `optimizeTopK`, passing full sort semantics (`nulls_direction`, collator) from the `SortColumnDescription` to the threshold tracker. Closes ClickHouse#99024 Made-with: Cursor

The skip-index top-k path (getTopKMarks / MinMaxGranuleItem::operator<) ranks granules using raw Field comparison, which does not respect `nulls_direction` or collation semantics. This could cause incorrect mark skipping for ORDER BY ... NULLS FIRST/LAST or COLLATE queries. Restrict skip-index top-k to non-nullable numeric types without collation, where raw Field ordering matches ORDER BY semantics. The dynamic filtering path (use_top_k_dynamic_filtering) remains fully available for all types since it uses the corrected comparators in `TopKThresholdTracker` and `FunctionTopKFilter`. Add test 04034_top_k_skip_index_nullable_guard covering: - EXPLAIN verification that skip-index top-k does NOT activate for Nullable(UInt32), String with COLLATE, or Nullable(String) with COLLATE - EXPLAIN verification that skip-index top-k still activates for plain UInt32 - Dynamic filtering correctness for all the above types with various ASC/DESC/NULLS FIRST orderings Made-with: Cursor

When only `use_skip_indexes_for_top_k` is enabled but the sort column type is nullable, non-numeric, or uses a collator, `tryOptimizeTopK` would still call `setTopKColumn` which unnecessarily disables the query-condition cache. Add type eligibility checks in `tryOptimizeTopK` (mirroring the guard in `ReadFromMergeTree::buildIndexes`) so that `setTopKColumn` is only called when at least one optimization path (skip-index or dynamic filtering) will actually activate. This avoids performance side effects from cache disablement on ineligible queries. Made-with: Cursor

The skip-index top-k path (MinMaxGranuleItem::operator<, getTopKMarks) uses raw Field comparison that does not respect nulls_direction or collation. Add a TODO to generalize the comparison so the type restriction can be lifted. Made-with: Cursor

Add SELECT 'label' lines before each test query so the reference files are self-documenting and easier to review. Made-with: Cursor

clickhouse-gh · 2026-03-09T06:44:55Z

Workflow [PR], commit [af9af49]

Summary: ✅

The fast test build does not include ICU, so `COLLATE 'en'` queries fail. Mark both top-k tests as `no-fasttest`. Made-with: Cursor

…filter The vectorized path incorrectly used strict `less`/`greater` instead of `lessOrEquals`/`greaterOrEquals` as on master. The nullable and collation paths had the same issue with `< 0` instead of `<= 0`. Rows equal to the threshold must pass the filter to ensure correctness of ORDER BY ... LIMIT. Also fix `isValueInsideThreshold` to treat equal values as inside the threshold (matching master's semantics). Made-with: Cursor

Instead of per-element virtual `compareAt` calls, unwrap the ColumnNullable, run the vectorized `lessOrEquals`/`greaterOrEquals` on the nested column, then patch NULL positions with a precomputed constant derived from direction and nulls_direction. Made-with: Cursor

- Remove the dead else branch in the collation path (compareAt without collation was unreachable since the method is only called when collator is set). - Rename executeWithColumnCompare to executeWithCollation to reflect its actual purpose. - Remove the use_general_comparison field; check collator directly. Made-with: Cursor

…ldTracker Replace the three separate fields (direction, nulls_direction, collator) in `TopKThresholdTracker` with a single `SortColumnDescription`, reducing redundancy while keeping the public getter API unchanged. In `FunctionTopKFilter`, replace the complex `executeNullable` path (manual nullable unwrapping + null patching) and separate `executeWithCollation` path with a single `executeGeneral` that uses `compareColumn` for CRTP-devirtualized column-level comparison (non-collation) and per-element `compareAtWithCollation` for the collation case. Made-with: Cursor

murphy-4o · 2026-03-10T11:50:09Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

rschu1ze · 2026-03-16T10:18:38Z

@murphy-4o Please include some a performance improvement measurement in the PR message.

src/Processors/QueryPlan/ReadFromMergeTree.cpp

tests/queries/0_stateless/04033_top_k_dynamic_filter_nullable_string.sql

shankar-iyer · 2026-03-18T05:55:36Z

@murphy-4o Please include some a performance improvement measurement in the PR message.

Yes!

We could add a performance test but the hits table only has a URL column of String type. We needed some other columns like email_id, api_path etc. The optimization will be useful when the ORDER BY ... LIMIT n is on a String column and there are other heavy predicates (on other String/JSON columns). Please check the effect on query latency with -

SELECT URL, EventTime FROM hits ORDER BY URL settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%yandex%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%google%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%xyzxyz%' settings use_top_k_dynamic_filtering = 1

The github_events table could be tried to see improvements in queries like

SELECT ... FROM github_events WHERE body ILIKE '%hack%' OR body ILIKE '%unsafe%' ORDER BY actor_login LIMIT 100 settings use_top_k_dynamic_filtering = 1

Again, we need many rows to pass the WHERE clause, else theuse_top_k_dynamic_filtering = 1could be a overhead. Ref : #99537

- Remove `skip_index->index.granularity != 1` guard: skip indexes with granularity > 1 are now supported (ref ClickHouse#93545). - Trim test tags to just `no-fasttest`; 10000 rows is fine for sanitizers. Made-with: Cursor

clickhouse-gh · 2026-03-18T16:44:26Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	83.70%	83.70%	+0.00%
Functions	23.90%	23.90%	+0.00%
Branches	76.30%	76.20%	-0.10%

PR changed lines: PR changed-lines coverage: 81.67% (147/180, 1 noise lines excluded)
Diff coverage report
Uncovered code

murphy-4o · 2026-03-19T01:53:12Z

@murphy-4o Please include some a performance improvement measurement in the PR message.

Yes!

We could add a performance test but the hits table only has a URL column of String type. We needed some other columns like email_id, api_path etc. The optimization will be useful when the ORDER BY ... LIMIT n is on a String column and there are other heavy predicates (on other String/JSON columns). Please check the effect on query latency with -
SELECT URL, EventTime FROM hits ORDER BY URL settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%yandex%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%google%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%xyzxyz%' settings use_top_k_dynamic_filtering = 1
The github_events table could be tried to see improvements in queries like

SELECT ... FROM github_events WHERE body ILIKE '%hack%' OR body ILIKE '%unsafe%' ORDER BY actor_login LIMIT 100 settings use_top_k_dynamic_filtering = 1

Again, we need many rows to pass the WHERE clause, else theuse_top_k_dynamic_filtering = 1could be a overhead. Ref : #99537

@rschu1ze @shankar-iyer attached performance result

shankar-iyer · 2026-03-19T05:56:05Z

@murphy-4o Please include some a performance improvement measurement in the PR message.

Yes!
We could add a performance test but the hits table only has a URL column of String type. We needed some other columns like email_id, api_path etc. The optimization will be useful when the ORDER BY ... LIMIT n is on a String column and there are other heavy predicates (on other String/JSON columns). Please check the effect on query latency with -
SELECT URL, EventTime FROM hits ORDER BY URL settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%yandex%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%google%' settings use_top_k_dynamic_filtering = 1

SELECT URL, EventTime FROM hits ORDER BY URL WHERE URL LIKE '%xyzxyz%' settings use_top_k_dynamic_filtering = 1
The github_events table could be tried to see improvements in queries like
SELECT ... FROM github_events WHERE body ILIKE '%hack%' OR body ILIKE '%unsafe%' ORDER BY actor_login LIMIT 100 settings use_top_k_dynamic_filtering = 1
Again, we need many rows to pass the WHERE clause, else theuse_top_k_dynamic_filtering = 1could be a overhead. Ref : #99537
@rschu1ze @shankar-iyer attached performance result

Thanks, nice results!

When the sort column in `ORDER BY ... LIMIT` matches the first column of the storage's sorting key, the read-in-order optimization reads data in sorted order and relies on early pipeline cancellation via `LIMIT` to minimize `rows_read`. The TopK dynamic filtering (prewhere `__topKFilter`) is counterproductive in this case: once the threshold is established after the first few rows, the prewhere rejects all subsequent rows (they are beyond the threshold in sorted order), preventing early cancellation and causing a full table scan. This was exposed by PR #99033 which removed the type restriction that previously prevented TopK optimization for String columns. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b949c229302bf2d8153647c38452ddffbb04cb66&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20distributed%20plan%2C%20parallel%2C%202%2F2%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

murphy-4o added 5 commits March 9, 2026 04:58

Refactor top-k test references to use SELECT labels for readability

b60c08a

Add SELECT 'label' lines before each test query so the reference files are self-documenting and easier to review. Made-with: Cursor

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 9, 2026

murphy-4o added 6 commits March 9, 2026 08:35

Add no-fasttest tag to tests using COLLATE

f2579c0

The fast test build does not include ICU, so `COLLATE 'en'` queries fail. Mark both top-k tests as `no-fasttest`. Made-with: Cursor

Merge branch 'master' into murphy_issue_99024

4992c53

cursor bot reviewed Mar 10, 2026

View reviewed changes

murphy-4o marked this pull request as ready for review March 10, 2026 12:10

Merge branch 'master' into murphy_issue_99024

75d964a

shankar-iyer self-assigned this Mar 16, 2026

shankar-iyer reviewed Mar 18, 2026

View reviewed changes

src/Processors/QueryPlan/ReadFromMergeTree.cpp Outdated Show resolved Hide resolved

shankar-iyer reviewed Mar 18, 2026

View reviewed changes

tests/queries/0_stateless/04033_top_k_dynamic_filter_nullable_string.sql Outdated Show resolved Hide resolved

Address review comments on PR ClickHouse#99033

af9af49

- Remove `skip_index->index.granularity != 1` guard: skip indexes with granularity > 1 are now supported (ref ClickHouse#93545). - Trim test tags to just `no-fasttest`; 10000 rows is fine for sanitizers. Made-with: Cursor

murphy-4o changed the title ~~add support for non-numeric & nullable types for FunctionTopKFilter~~ Support non-numeric & nullable types for FunctionTopKFilter Mar 19, 2026

shankar-iyer approved these changes Mar 20, 2026

View reviewed changes

murphy-4o added this pull request to the merge queue Mar 20, 2026

Merged via the queue into ClickHouse:master with commit b949c22 Mar 20, 2026
951 of 958 checks passed

murphy-4o deleted the murphy_issue_99024 branch March 20, 2026 05:32

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 20, 2026

alexey-milovidov mentioned this pull request Mar 20, 2026

Disable TopK dynamic filtering when read-in-order applies #100235

Merged

1 task

murphy-4o mentioned this pull request Mar 21, 2026

Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K #98607

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-numeric & nullable types for FunctionTopKFilter#99033

Support non-numeric & nullable types for FunctionTopKFilter#99033
murphy-4o merged 13 commits intoClickHouse:masterfrom
murphy-4o:murphy_issue_99024

murphy-4o commented Mar 9, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

murphy-4o commented Mar 10, 2026

Uh oh!

cursor bot left a comment

Uh oh!

rschu1ze commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

shankar-iyer commented Mar 18, 2026

Uh oh!

clickhouse-gh bot commented Mar 18, 2026

Uh oh!

murphy-4o commented Mar 19, 2026

Uh oh!

shankar-iyer commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

murphy-4o commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

murphy-4o commented Mar 10, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

rschu1ze commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

shankar-iyer commented Mar 18, 2026

Uh oh!

clickhouse-gh bot commented Mar 18, 2026

LLVM Coverage Report

Uh oh!

murphy-4o commented Mar 19, 2026

Uh oh!

shankar-iyer commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

murphy-4o commented Mar 9, 2026 •

edited

Loading

clickhouse-gh bot commented Mar 9, 2026 •

edited

Loading