Support non-numeric & nullable types for FunctionTopKFilter#99033
Support non-numeric & nullable types for FunctionTopKFilter#99033murphy-4o merged 13 commits intoClickHouse:masterfrom
Conversation
…ation types `FunctionTopKFilter` and `TopKThresholdTracker` previously only supported numeric non-nullable types, using simplified direction-only comparisons (`less`/`greater`). This limited `use_top_k_dynamic_filtering` to a small subset of ORDER BY ... LIMIT queries. This change: - Extends `TopKThresholdTracker` to carry `nulls_direction` and an optional `Collator`, using a `compareFields` method with proper NULL and collation-aware semantics matching the real sort path. - Adds nullable and collation execution paths to `FunctionTopKFilter`: the fast vectorized `less`/`greater` path is kept for non-nullable non-collation types; a per-element `compareAt` path handles Nullable columns with correct `nulls_direction`; a `compareAtWithCollation` path handles collation-aware string ordering. - Sets `useDefaultImplementationForNulls` to false so that NULL inputs produce correct boolean filter results instead of NULL (which PREWHERE interprets as false). - Removes the `isValueRepresentedByNumber() || isNullable()` gate in `optimizeTopK`, passing full sort semantics (`nulls_direction`, collator) from the `SortColumnDescription` to the threshold tracker. Closes ClickHouse#99024 Made-with: Cursor
The skip-index top-k path (getTopKMarks / MinMaxGranuleItem::operator<) ranks granules using raw Field comparison, which does not respect `nulls_direction` or collation semantics. This could cause incorrect mark skipping for ORDER BY ... NULLS FIRST/LAST or COLLATE queries. Restrict skip-index top-k to non-nullable numeric types without collation, where raw Field ordering matches ORDER BY semantics. The dynamic filtering path (use_top_k_dynamic_filtering) remains fully available for all types since it uses the corrected comparators in `TopKThresholdTracker` and `FunctionTopKFilter`. Add test 04034_top_k_skip_index_nullable_guard covering: - EXPLAIN verification that skip-index top-k does NOT activate for Nullable(UInt32), String with COLLATE, or Nullable(String) with COLLATE - EXPLAIN verification that skip-index top-k still activates for plain UInt32 - Dynamic filtering correctness for all the above types with various ASC/DESC/NULLS FIRST orderings Made-with: Cursor
When only `use_skip_indexes_for_top_k` is enabled but the sort column type is nullable, non-numeric, or uses a collator, `tryOptimizeTopK` would still call `setTopKColumn` which unnecessarily disables the query-condition cache. Add type eligibility checks in `tryOptimizeTopK` (mirroring the guard in `ReadFromMergeTree::buildIndexes`) so that `setTopKColumn` is only called when at least one optimization path (skip-index or dynamic filtering) will actually activate. This avoids performance side effects from cache disablement on ineligible queries. Made-with: Cursor
The skip-index top-k path (MinMaxGranuleItem::operator<, getTopKMarks) uses raw Field comparison that does not respect nulls_direction or collation. Add a TODO to generalize the comparison so the type restriction can be lifted. Made-with: Cursor
Add SELECT 'label' lines before each test query so the reference files are self-documenting and easier to review. Made-with: Cursor
The fast test build does not include ICU, so `COLLATE 'en'` queries fail. Mark both top-k tests as `no-fasttest`. Made-with: Cursor
…filter The vectorized path incorrectly used strict `less`/`greater` instead of `lessOrEquals`/`greaterOrEquals` as on master. The nullable and collation paths had the same issue with `< 0` instead of `<= 0`. Rows equal to the threshold must pass the filter to ensure correctness of ORDER BY ... LIMIT. Also fix `isValueInsideThreshold` to treat equal values as inside the threshold (matching master's semantics). Made-with: Cursor
Instead of per-element virtual `compareAt` calls, unwrap the ColumnNullable, run the vectorized `lessOrEquals`/`greaterOrEquals` on the nested column, then patch NULL positions with a precomputed constant derived from direction and nulls_direction. Made-with: Cursor
- Remove the dead else branch in the collation path (compareAt without collation was unreachable since the method is only called when collator is set). - Rename executeWithColumnCompare to executeWithCollation to reflect its actual purpose. - Remove the use_general_comparison field; check collator directly. Made-with: Cursor
…ldTracker Replace the three separate fields (direction, nulls_direction, collator) in `TopKThresholdTracker` with a single `SortColumnDescription`, reducing redundancy while keeping the public getter API unchanged. In `FunctionTopKFilter`, replace the complex `executeNullable` path (manual nullable unwrapping + null patching) and separate `executeWithCollation` path with a single `executeGeneral` that uses `compareColumn` for CRTP-devirtualized column-level comparison (non-collation) and per-element `compareAtWithCollation` for the collation case. Made-with: Cursor
|
@cursor review |
|
@murphy-4o Please include some a performance improvement measurement in the PR message. |
tests/queries/0_stateless/04033_top_k_dynamic_filter_nullable_string.sql
Outdated
Show resolved
Hide resolved
Yes! We could add a performance test but the The
Again, we need many rows to pass the |
- Remove `skip_index->index.granularity != 1` guard: skip indexes with granularity > 1 are now supported (ref ClickHouse#93545). - Trim test tags to just `no-fasttest`; 10000 rows is fine for sanitizers. Made-with: Cursor
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 81.67% (147/180, 1 noise lines excluded) |
@rschu1ze @shankar-iyer attached performance result |
Thanks, nice results! |
b949c22
When the sort column in `ORDER BY ... LIMIT` matches the first column of the storage's sorting key, the read-in-order optimization reads data in sorted order and relies on early pipeline cancellation via `LIMIT` to minimize `rows_read`. The TopK dynamic filtering (prewhere `__topKFilter`) is counterproductive in this case: once the threshold is established after the first few rows, the prewhere rejects all subsequent rows (they are beyond the threshold in sorted order), preventing early cancellation and causing a full table scan. This was exposed by PR #99033 which removed the type restriction that previously prevented TopK optimization for String columns. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b949c229302bf2d8153647c38452ddffbb04cb66&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20distributed%20plan%2C%20parallel%2C%202%2F2%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes #99024
The
ORDER BY ... LIMITtop-k dynamic filtering optimization (use_top_k_dynamic_filtering) was previously restricted to non-nullable numeric types. This PR generalizesFunctionTopKFilterandTopKThresholdTrackerto support:Nullable(T)with correctNULLS FIRST/NULLS LASTsemanticsStringandNullable(String)COLLATE-aware string orderingTODO
The skip-index top-k path (
use_skip_indexes_for_top_k) remains restricted to non-nullable numeric types because itsMinMaxGranuleItem::operator</getTopKMarksuse rawFieldcomparison that does not respectnulls_directionor collation (left as a TODO).Perf Result
hits_nullable_full(39M rows,MergeTree ORDER BY CounterID)SELECT col FROM hits_nullable_full ORDER BY col LIMIT 100Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Generalize
ORDER BY ... LIMITtop-k dynamic filtering to support Nullable, String, and COLLATE types.Documentation entry for user-facing changes