Optimize ORDER BY...LIMIT N queries by skip index and dynamic threshold filter by shankar-iyer · Pull Request #89835 · ClickHouse/ClickHouse

shankar-iyer · 2025-11-11T08:50:50Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Optimize ORDER BY...LIMIT N queries by using skip index and dynamic threshold filter to significantly reduce rows processed.

Details

Implements optimizations for top N query processing. All have a common theme -

If a query does not have any predicates and minmax index exists on the ORDER BY column, use the minmax index to shortlist top qualifying granules e.g only 400K rows processed out of 100M below -

SELECT URL, EventTime
FROM hits_indexed
ORDER BY EventTime
LIMIT 10 
SETTINGS use_skip_indexes_for_top_n=1;

10 rows in set. Elapsed: 0.011 sec. Processed 477.66 thousand rows, 5.73 MB

[Open/TODO] Extension to above: If a query has predicates and primary index / skip indexes can identify granules that will definitely contain at least 1 row matching the predicate, proceed with the above optimization. e.g range scan on primary key, text index lookup (e.g SELECT * FROM logs WHERE hasToken(message, 'OOM') ORDER BY timestamp LIMIT 10. This item needs some work and discussions.
For general queries (not in 1 and 2), implements a dynamic top-N threshold filtering using the minmax skip index. As the query executes and top-N values keep evolving, the current Nth value (or threshold) can be used to lookup the minmax index and totally skip granules that cannot enhance the top-N resultset. e.g

SELECT url, EventTime
FROM hits_indexed
WHERE url LIKE '%google%' 
ORDER BY EventTime
LIMIT 10;

10 rows in set. Elapsed: 0.280 sec. Processed 100.00 million rows, 9.91 GB

SELECT URL, EventTime
FROM hits_indexed
WHERE URL LIKE '%google%' 
ORDER BY EventTime LIMIT 10 
SETTINGS use_skip_indexes_for_top_n=1, use_top_n_dynamic_filtering=1, use_skip_indexes_on_data_read=1;
10 rows in set. Elapsed: 0.040 sec. Processed 10.08 million rows, 674.31 MB

Next step implemented in PR - If a minmax skip index is not present, still use the top-N threshold to execute a PREWHERE on the sorting column. This optimization will be useful if there are other costly predicates in the query and thus evaluating the dynamic PREWHERE and rejecting rows could turn out cheaper. This was initially proposed in WIP some perf optimizations #81944

I still need to fix some issues, high level and low level feedback welcome!

clickhouse-gh · 2025-11-11T08:51:19Z

Workflow [PR], commit [5f997d2]

Summary: ❌

job_name	test_name	status	info
Stateless tests (amd_tsan, s3 storage, sequential, 2/2)		failure
	03708_flush_async_insert_queue_for_table	FAIL	cidb
AST fuzzer (amd_tsan)		failure
	Logical error: Cannot fold actions for projection. Node A requires input B which does not belong to projection'	FAIL	cidb
BuzzHouse (amd_debug)		failure
	Logical error: Inconsistent AST formatting: the query:	FAIL	cidb
BuzzHouse (amd_ubsan)		failure
	UndefinedBehaviorSanitizer: undefined behavior (STID: 3075-3638)	FAIL	cidb
Stateless tests (arm_asan, targeted)		error

shankar-iyer · 2025-11-11T08:51:25Z

@alexey-milovidov

EmeraldShift · 2025-11-11T12:16:34Z

This sounds awesome, thank you! I wonder if there's a slightly slower, but more general, version of (2) that works with predicates, by roughly ranking the ranges to read by the minmax index or statistics for the ORDER BY column, then reading in that order?

For example, most tables are partitioned by time, and within a partition, recent parts are more likely to have recent data, so a query with ORDER BY timestamp DESC might be executed lot more efficiently by scanning partitions in reverse order, then parts in reverse block number order, then granules in some possibly shuffled order.

Maybe it will cause more random I/O, but I guess you're overall saving I/O by skipping low-sorted granules, and with a LIMIT, you're able to force the dynamic TopN threshold to fill with the top values first, so you can skip more granules overall.

Would this be possible? How does it sound?

EmeraldShift · 2025-11-11T13:48:50Z

An even easier alternative to my proposal might be to do some internal "request chunking", like was recently introduced in ClickStack, where the query is broken up into ranges over the ORDER BY column, and the top ranges' queries are executed first, all before the sorting and limiting, so the lower ranges are never queried once the limit is hit.

shankar-iyer · 2025-11-13T16:18:00Z

@EmeraldShift Thank you for the inputs and sorry for the late reply!

you're able to force the dynamic TopN threshold to fill with the top values first, so you can skip more granules overall.
Would this be possible? How does it sound?

This sounds cool and if we do find a good threshold say within just reading/sorting 5%-10% of the data, that would really let the query skip reading vast number of granules. Let us try to firm up some logic into this PR - like you mentioned in your next comment. Maybe we do some specific logic for enqueue'ing read ranges if sorting is by Date / DateTime columns.

An even easier alternative to my proposal might be to do some internal "request chunking", like was recently introduced in ClickStack, where the query is broken up into ranges over the ORDER BY column, and the top ranges' queries are executed first, all before the sorting and limiting, so the lower ranges are never queried once the limit is hit.

Let me check this.

shankar-iyer · 2025-11-14T16:42:02Z

Some interesting runs with the optimizations -

Total rows in hits_indexed = 100 million
URLs like '%google%' -  only 15911
URLs like '%yandex%' - 14914657 (14 million)

Select top 10 with URL like '%google%' - good impact

select * from hits_indexed where URL like '%google%' order by EventTime limit 10 format null settings use_skip_indexes_for_top_n=0,use_top_n_dynamic_filtering=0,use_query_condition_cache=0;

Base
0 rows in set. Elapsed: 0.368 sec. Processed 100.00 million rows, 10.07 GB (271.56 million rows/s., 27.34 GB/s.)

With only skip index optimization (note the rows processed) - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 0.102 sec. Processed 11.65 million rows, 845.89 MB (114.63 million rows/s., 8.33 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 0.115 sec. Processed 100.00 million rows, 1.03 GB (873.17 million rows/s., 9.00 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 0.111 sec. Processed 11.44 million rows, 868.89 MB (103.06 million rows/s., 7.83 GB/s.)

Select top 1000 with URL like '%google%' - Not so good

Base
0 rows in set. Elapsed: 1.018 sec. Processed 100.00 million rows, 21.61 GB (98.26 million rows/s., 21.23 GB/s.)

With only skip index optimization - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 1.033 sec. Processed 100.00 million rows, 21.61 GB (96.78 million rows/s., 20.91 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 2.106 sec. Processed 100.00 million rows, 47.32 GB (47.48 million rows/s., 22.46 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 2.087 sec. Processed 78.22 million rows, 48.15 GB (37.49 million rows/s., 23.08 GB/s.)

With LIMIT 1000, the partial sorting threads do not get a block of 1000 rows after the WHERE clause and hence there is no threshold identified or the threshold is identified only very late in the query execution.

Select top 1000 with URL like '%yandex%'.- this is where the optimizations make an impact!

Base
0 rows in set. Elapsed: 3.068 sec. Processed 100.00 million rows, 61.17 GB (32.60 million rows/s., 19.94 GB/s.)

With only skip index optimization - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 0.299 sec. Processed 5.46 million rows, 3.21 GB (18.26 million rows/s., 10.73 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 0.261 sec. Processed 100.00 million rows, 3.29 GB (383.58 million rows/s., 12.61 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 0.259 sec. Processed 5.37 million rows, 2.95 GB (20.72 million rows/s., 11.40 GB/s.)

Select top 10000 with URL like '%yandex%'.- Good impact event with LIMIT 10000

Base
0 rows in set. Elapsed: 3.425 sec. Processed 100.00 million rows, 61.17 GB (29.20 million rows/s., 17.86 GB/s.)

With only skip index optimization - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 0.485 sec. Processed 7.10 million rows, 4.13 GB (14.66 million rows/s., 8.53 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 0.451 sec. Processed 100.00 million rows, 4.29 GB (221.87 million rows/s., 9.51 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 0.381 sec. Processed 7.02 million rows, 4.02 GB (18.45 million rows/s., 10.56 GB/s.)

Just to complete - Select top - 10 like '%yandex%'

base : 0.515.sec
With use_skip_indexes_for_top_n = ON :  0.089 sec
With use_top_n_dynamic_filtering = ON : 0.093 sec
With both ON : 0.085 sec

I am experimenting a bit more.

EmeraldShift · 2025-11-16T13:32:43Z

Great results! 😁

Quick question: does this optimization support LIMIT BY? If it supports early return for a query like

SELECT trace_id
ORDER BY time DESC
LIMIT 1 BY trace_id
LIMIT 10

then this probably closes #75098!

EmeraldShift · 2025-11-16T13:46:44Z

(and #65990)

Copilot

Pull Request Overview

This PR implements optimizations for ORDER BY ... LIMIT N queries (top-N queries) by leveraging skip indexes and dynamic threshold filtering to reduce the number of rows processed. The implementation introduces three main optimization strategies: (1) using minmax skip indexes to pre-select qualifying granules when no predicates exist, (2) dynamic threshold filtering during query execution using minmax indexes, and (3) PREWHERE-based filtering using dynamically computed thresholds.

Key changes include:

New TopNThresholdTracker component for thread-safe threshold value sharing across query execution components
Extension of minmax index functionality to support bulk granule operations and top-N filtering
Query plan optimization to detect and configure top-N scenarios
New internal function __topNFilter for PREWHERE-based filtering
Two new settings: use_skip_indexes_for_top_n and use_top_n_dynamic_filtering

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
src/Processors/TopNThresholdTracker.h	New thread-safe component for tracking and sharing top-N threshold values across optimization components
src/Functions/FunctionTopNFilter.cpp	New internal function for dynamic PREWHERE filtering based on evolving top-N thresholds
src/Storages/MergeTree/MergeTreeIndexMinMax.{h,cpp}	Extended to support bulk granule reading and top-N mark selection
src/Storages/MergeTree/MergeTreeIndexReadResultPool.{h,cpp}	Modified to support minmax index for top-N filtering and threshold tracking
src/Storages/MergeTree/MergeTreeReaderIndex.cpp	Enhanced mark skipping logic to use dynamic threshold filtering
src/Storages/MergeTree/MergeTreeDataSelectExecutor.{h,cpp}	Added top-N filtering using minmax indexes during part selection
src/Processors/Transforms/PartialSortingTransform.{h,cpp}	Integrated threshold tracker to publish top-N values during sorting
src/Processors/Transforms/MergeSortingTransform.{h,cpp}	Added threshold tracking and adaptive remerge logic for top-N optimization
src/Processors/QueryPlan/SortingStep.{h,cpp}	Plumbed threshold tracker through sorting pipeline
src/Processors/QueryPlan/ReadFromMergeTree.{h,cpp}	Added top-N filter info and skip index availability checking
src/Processors/QueryPlan/Optimizations/optimizeTopN.cpp	New optimization pass to detect and configure top-N query patterns
src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.{h,cpp}	Added settings for controlling top-N optimizations
src/Processors/QueryPlan/Optimizations/Optimizations.h	Registered new top-N optimization in the optimization list
src/Processors/QueryPlan/Optimizations/optimizeTree.cpp	Propagated new settings through optimization passes
src/Core/Settings.cpp	Added documentation for new top-N optimization settings
src/Core/SettingsChangesHistory.cpp	Recorded new settings in version history
src/Functions/CMakeLists.txt	Added FunctionTopNFilter to build
tests/queries/0_stateless/03711_top_n_by_skip_index.{sql,reference}	Test cases for top-N optimization with skip indexes
tests/performance/top_n_optimization_hits.xml	Performance benchmark for top-N optimization

Copilot · 2025-11-16T21:32:54Z

src/Processors/Transforms/MergeSortingTransform.cpp

        && remerge_is_useful
        && max_bytes_before_remerge
-        && sum_bytes_in_blocks > max_bytes_before_remerge)
+        && sum_bytes_in_blocks > max_bytes_before_remerge) || (threshold_tracker && (sum_rows_in_blocks > limit*1.5)))


Magic number: The hardcoded value 1.5 should be extracted as a named constant (e.g., threshold_tracker_remerge_factor) to improve code readability and maintainability.

src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp

src/Processors/QueryPlan/Optimizations/optimizeTopN.cpp

src/Functions/FunctionTopNFilter.cpp

src/Core/Settings.cpp

src/Processors/TopKThresholdTracker.h

src/Processors/TopNThresholdTracker.h

src/Storages/MergeTree/MergeTreeIndexMinMax.cpp

src/Processors/TopKThresholdTracker.h

src/Processors/Transforms/MergeSortingTransform.cpp

shankar-iyer · 2025-11-17T10:07:25Z

Great results! 😁

Quick question: does this optimization support LIMIT BY? If it supports early return for a query like
SELECT trace_id
ORDER BY time DESC
LIMIT 1 BY trace_id
LIMIT 10
then this probably closes #75098!

LIMIT BY and LIMIT...WITH TIES looks infeasible

I need to check implementation of LIMIT n OFFSET m and I am hoping to support multi-column sort e.g ORDER BY date, status (just use the first column as the threshold and tweak the threshold comparision from < to <=

EmeraldShift · 2025-11-17T11:38:40Z

I'm curious about what makes LIMIT BY (even when combined with an ordinary LIMIT) infeasible. I would imagine the LIMIT BY just acts as a filter to avoid collecting rows when there is already K results with the same limit key, unless the ORDER BY value is better. Once we have filtered and collected N results for LIMIT K BY ... LIMIT N, isn't the N'th result's ORDER BY key still a valid threshold to filter the remaining parts by? Or maybe LIMIT BY is actually implemented differently in ClickHouse, or I'm missing something subtle about how it works?

EmeraldShift · 2025-11-17T23:45:09Z

Another random thought for the google test with LIMIT 1000: as you point out, no single partial sorting thread reaches a block of 1000, but maybe it's still possible to derive a looser threshold, if every thread reports its threshold, and the number of rows it has collected?

For example, three threads solving ORDER BY N DESC may report their Nth threshold value and counts like this:

thread 1: Nth=123, count=300
thread 2: Nth=120, count=250
thread 3: Nth=60, count=1250

Given this data, we know there are at least 300 rows with N >= 123, and 250 + 300 = 550 rows with N >= 120, so if the query had e.g. LIMIT 500 we could start using N > 120 as the PREWHERE and minmax filter.

In general, I suppose we can scan the threads in threshold order (in this case, following N DESC), and proceed until the sum of all counts seen so far is >= the LIMIT. If one thread already has enough rows, the algorithm will just pick its threshold. Otherwise, we'll find some slightly worse but still non-trivial threshold if all the threads combined can exceed the limit.

EmeraldShift · 2025-12-01T14:16:45Z

Some more results from further testing:

I was able to get around this not working for Distributed tables by setting distributed_group_by_no_merge=1 to force distributed_push_down_limit to work (which I feel is a bug in distributed_push_down_limit since there is no GROUP BY, but anyway..)
I received a segmentation fault on one of the tables:

[host] 2025.12.01 08:56:58.608451 [ 854034 ] <Fatal> BaseDaemon: ########################################
[host] 2025.12.01 08:56:58.610126 [ 854034 ] <Fatal> BaseDaemon: (version 25.12.1.198, build id: 28D24C4F0FAB325F967BA4179CF4E169F0EFD7DA, git hash: c0e760d86866da5604808fc264ad6c85cf8ef318) (from thread 854868) (query_id: a4189ef9-1e18-4c65-8804-e6c93bd10ba9) (query: select * from repl.spans where system = 'AAAA'  order by start_time desc limit 100 settings use_query_condition_cache=0, use_skip_indexes_on_data_read, use_skip_indexes_for_top_n=1) Received signal Segmentation fault (11)
[host] 2025.12.01 08:56:58.610136 [ 854034 ] <Fatal> BaseDaemon: Address: 0x2b. Access: read. Address not mapped to object.
[host] 2025.12.01 08:56:58.610145 [ 854034 ] <Fatal> BaseDaemon: Stack trace: 0x00000000155adefb 0x00000000155ae318 0x000000001b35ff7c 0x000000001b30617d 0x000000001b33a3b5 0x000000001b338879 0x000000001c01dd4c 0x000000001b347338 0x000000001b34925b 0x000000001c05c7d2 0x000000001ba53cb6 0x000000001ba70b57 0x000000001ba6311b 0x000000001ba67143 0x00000000151a428b 0x00000000151ab1a6 0x00000000151a141f 0x00000000151a8ada 0x00007ffff70081ea 0x00007ffff6c399d3
[host] 2025.12.01 08:56:58.612274 [ 854034 ] <Fatal> BaseDaemon: 2. void std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__do_rehash<true>(unsigned long) @ 0x00000000155adefb
[host] 2025.12.01 08:56:58.612939 [ 854034 ] <Fatal> BaseDaemon: 3. std::pair<std::__hash_iterator<std::__hash_node<std::__hash_value_type<unsigned long, unsigned long>, void*>*>, bool> std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__emplace_unique_key_args<unsigned long, std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<>>(unsigned long const&, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) @ 0x00000000155ae318
[host] 2025.12.01 08:56:58.612955 [ 854034 ] <Fatal> BaseDaemon: 4. DB::MergeTreeReaderIndex::canSkipMark(unsigned long, unsigned long) @ 0x000000001b35ff7c
[host] 2025.12.01 08:56:58.612971 [ 854034 ] <Fatal> BaseDaemon: 5. DB::MergeTreeRangeReader::startReadingChain(unsigned long, DB::MarkRanges&) @ 0x000000001b30617d
[host] 2025.12.01 08:56:58.613586 [ 854034 ] <Fatal> BaseDaemon: 6. DB::MergeTreeReadersChain::read(unsigned long, DB::MarkRanges&, std::vector<DB::MarkRanges, std::allocator<DB::MarkRanges>>&) @ 0x000000001b33a3b5
[host] 2025.12.01 08:56:58.614260 [ 854034 ] <Fatal> BaseDaemon: 7. DB::MergeTreeReadTask::read() @ 0x000000001b338879
[host] 2025.12.01 08:56:58.614276 [ 854034 ] <Fatal> BaseDaemon: 8. DB::MergeTreeThreadSelectAlgorithm::readFromTask(DB::MergeTreeReadTask&) @ 0x000000001c01dd4c
[host] 2025.12.01 08:56:58.614286 [ 854034 ] <Fatal> BaseDaemon: 9. DB::MergeTreeSelectProcessor::readCurrentTask(DB::MergeTreeReadTask&, DB::IMergeTreeSelectAlgorithm&) const @ 0x000000001b347338
[host] 2025.12.01 08:56:58.614294 [ 854034 ] <Fatal> BaseDaemon: 10. DB::MergeTreeSelectProcessor::read() @ 0x000000001b34925b
[host] 2025.12.01 08:56:58.614320 [ 854034 ] <Fatal> BaseDaemon: 11. DB::MergeTreeSource::tryGenerate() @ 0x000000001c05c7d2
[host] 2025.12.01 08:56:58.614332 [ 854034 ] <Fatal> BaseDaemon: 12. DB::ISource::work() @ 0x000000001ba53cb6
[host] 2025.12.01 08:56:58.614344 [ 854034 ] <Fatal> BaseDaemon: 13. DB::ExecutionThreadContext::executeTask() @ 0x000000001ba70b57
[host] 2025.12.01 08:56:58.614968 [ 854034 ] <Fatal> BaseDaemon: 14. DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x000000001ba6311b
[host] 2025.12.01 08:56:58.615582 [ 854034 ] <Fatal> BaseDaemon: 15. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<DB::PipelineExecutor::spawnThreads(std::shared_ptr<DB::IAcquiredSlot>)::$_0>(std::__function::__policy_storage const*) @ 0x000000001ba67143
[host] 2025.12.01 08:56:58.615596 [ 854034 ] <Fatal> BaseDaemon: 16. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000151a428b
[host] 2025.12.01 08:56:58.615614 [ 854034 ] <Fatal> BaseDaemon: 17. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'()>(std::__function::__policy_storage const*) @ 0x00000000151ab1a6
[host] 2025.12.01 08:56:58.615623 [ 854034 ] <Fatal> BaseDaemon: 18. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000151a141f
[host] 2025.12.01 08:56:58.616246 [ 854034 ] <Fatal> BaseDaemon: 19. void* std::__thread_proxy[abi:ne210105]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000151a8ada
[host] 2025.12.01 08:56:58.616869 [ 854034 ] <Fatal> BaseDaemon: 20. Hdfs::Internal::GetSystemErrorInfo(int)::message (.llvm.18351245492790364349) @ 0x00000000000081ea
[host] 2025.12.01 08:56:58.617006 [ 854034 ] <Fatal> BaseDaemon: 21. clone @ 0x00000000000399d3
[host] 2025.12.01 08:56:58.617015 [ 854034 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
[host] 2025.12.01 08:57:08.629311 [ 854034 ] <Fatal> BaseDaemon: Changed settings: use_skip_indexes_on_data_read = true, use_skip_indexes_for_top_n = true, use_query_condition_cache = false

shankar-iyer · 2025-12-02T08:59:49Z

Some more results from further testing:

I was able to get around this not working for Distributed tables by setting distributed_group_by_no_merge=1 to force distributed_push_down_limit to work (which I feel is a bug in distributed_push_down_limit since there is no GROUP BY, but anyway..)
I received a segmentation fault on one of the tables:

[host] 2025.12.01 08:56:58.608451 [ 854034 ] <Fatal> BaseDaemon: ########################################
[host] 2025.12.01 08:56:58.610126 [ 854034 ] <Fatal> BaseDaemon: (version 25.12.1.198, build id: 28D24C4F0FAB325F967BA4179CF4E169F0EFD7DA, git hash: c0e760d86866da5604808fc264ad6c85cf8ef318) (from thread 854868) (query_id: a4189ef9-1e18-4c65-8804-e6c93bd10ba9) (query: select * from repl.spans where system = 'AAAA'  order by start_time desc limit 100 settings use_query_condition_cache=0, use_skip_indexes_on_data_read, use_skip_indexes_for_top_n=1) Received signal Segmentation fault (11)
[host] 2025.12.01 08:56:58.610136 [ 854034 ] <Fatal> BaseDaemon: Address: 0x2b. Access: read. Address not mapped to object.
[host] 2025.12.01 08:56:58.610145 [ 854034 ] <Fatal> BaseDaemon: Stack trace: 0x00000000155adefb 0x00000000155ae318 0x000000001b35ff7c 0x000000001b30617d 0x000000001b33a3b5 0x000000001b338879 0x000000001c01dd4c 0x000000001b347338 0x000000001b34925b 0x000000001c05c7d2 0x000000001ba53cb6 0x000000001ba70b57 0x000000001ba6311b 0x000000001ba67143 0x00000000151a428b 0x00000000151ab1a6 0x00000000151a141f 0x00000000151a8ada 0x00007ffff70081ea 0x00007ffff6c399d3
[host] 2025.12.01 08:56:58.612274 [ 854034 ] <Fatal> BaseDaemon: 2. void std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__do_rehash<true>(unsigned long) @ 0x00000000155adefb
[host] 2025.12.01 08:56:58.612939 [ 854034 ] <Fatal> BaseDaemon: 3. std::pair<std::__hash_iterator<std::__hash_node<std::__hash_value_type<unsigned long, unsigned long>, void*>*>, bool> std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__emplace_unique_key_args<unsigned long, std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<>>(unsigned long const&, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) @ 0x00000000155ae318
[host] 2025.12.01 08:56:58.612955 [ 854034 ] <Fatal> BaseDaemon: 4. DB::MergeTreeReaderIndex::canSkipMark(unsigned long, unsigned long) @ 0x000000001b35ff7c
[host] 2025.12.01 08:56:58.612971 [ 854034 ] <Fatal> BaseDaemon: 5. DB::MergeTreeRangeReader::startReadingChain(unsigned long, DB::MarkRanges&) @ 0x000000001b30617d
[host] 2025.12.01 08:56:58.613586 [ 854034 ] <Fatal> BaseDaemon: 6. DB::MergeTreeReadersChain::read(unsigned long, DB::MarkRanges&, std::vector<DB::MarkRanges, std::allocator<DB::MarkRanges>>&) @ 0x000000001b33a3b5
[host] 2025.12.01 08:56:58.614260 [ 854034 ] <Fatal> BaseDaemon: 7. DB::MergeTreeReadTask::read() @ 0x000000001b338879
[host] 2025.12.01 08:56:58.614276 [ 854034 ] <Fatal> BaseDaemon: 8. DB::MergeTreeThreadSelectAlgorithm::readFromTask(DB::MergeTreeReadTask&) @ 0x000000001c01dd4c
[host] 2025.12.01 08:56:58.614286 [ 854034 ] <Fatal> BaseDaemon: 9. DB::MergeTreeSelectProcessor::readCurrentTask(DB::MergeTreeReadTask&, DB::IMergeTreeSelectAlgorithm&) const @ 0x000000001b347338
[host] 2025.12.01 08:56:58.614294 [ 854034 ] <Fatal> BaseDaemon: 10. DB::MergeTreeSelectProcessor::read() @ 0x000000001b34925b
[host] 2025.12.01 08:56:58.614320 [ 854034 ] <Fatal> BaseDaemon: 11. DB::MergeTreeSource::tryGenerate() @ 0x000000001c05c7d2
[host] 2025.12.01 08:56:58.614332 [ 854034 ] <Fatal> BaseDaemon: 12. DB::ISource::work() @ 0x000000001ba53cb6
[host] 2025.12.01 08:56:58.614344 [ 854034 ] <Fatal> BaseDaemon: 13. DB::ExecutionThreadContext::executeTask() @ 0x000000001ba70b57
[host] 2025.12.01 08:56:58.614968 [ 854034 ] <Fatal> BaseDaemon: 14. DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x000000001ba6311b
[host] 2025.12.01 08:56:58.615582 [ 854034 ] <Fatal> BaseDaemon: 15. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<DB::PipelineExecutor::spawnThreads(std::shared_ptr<DB::IAcquiredSlot>)::$_0>(std::__function::__policy_storage const*) @ 0x000000001ba67143
[host] 2025.12.01 08:56:58.615596 [ 854034 ] <Fatal> BaseDaemon: 16. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000151a428b
[host] 2025.12.01 08:56:58.615614 [ 854034 ] <Fatal> BaseDaemon: 17. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'()>(std::__function::__policy_storage const*) @ 0x00000000151ab1a6
[host] 2025.12.01 08:56:58.615623 [ 854034 ] <Fatal> BaseDaemon: 18. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000151a141f
[host] 2025.12.01 08:56:58.616246 [ 854034 ] <Fatal> BaseDaemon: 19. void* std::__thread_proxy[abi:ne210105]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000151a8ada
[host] 2025.12.01 08:56:58.616869 [ 854034 ] <Fatal> BaseDaemon: 20. Hdfs::Internal::GetSystemErrorInfo(int)::message (.llvm.18351245492790364349) @ 0x00000000000081ea
[host] 2025.12.01 08:56:58.617006 [ 854034 ] <Fatal> BaseDaemon: 21. clone @ 0x00000000000399d3
[host] 2025.12.01 08:56:58.617015 [ 854034 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
[host] 2025.12.01 08:57:08.629311 [ 854034 ] <Fatal> BaseDaemon: Changed settings: use_skip_indexes_on_data_read = true, use_skip_indexes_for_top_n = true, use_query_condition_cache = false

@EmeraldShift By any chance, were you testing without commit - 9cd57bf ?

Is skip index defined on the order by column, but index is not yet materialized (on some parts) ?

EmeraldShift · 2025-12-02T13:16:30Z

The testing is on commit c0e760d86866, which is a merge of 2cf191a784b6 and master, which does contain 9cd57bfe439d.

Here's the table schema:

CREATE TABLE repl.spans
(
    `environment` LowCardinality(String) CODEC(ZSTD(1)),
    `system` LowCardinality(String) CODEC(ZSTD(1)),
    `service` LowCardinality(String) CODEC(ZSTD(1)),
    `task` LowCardinality(String) CODEC(ZSTD(1)),
    `task_process` LowCardinality(String) CODEC(ZSTD(1)),
    `host` LowCardinality(String) CODEC(ZSTD(1)),
    `host_location` LowCardinality(String) CODEC(ZSTD(1)),
    `host_region` LowCardinality(String) CODEC(ZSTD(1)),
    `user` LowCardinality(String) CODEC(ZSTD(1)),
    `name` LowCardinality(String) CODEC(ZSTD(1)),
    `span_id` FixedString(8) CODEC(NONE),
    `trace_id` FixedString(16) CODEC(ZSTD(1)),
    `parent_id` FixedString(8) CODEC(ZSTD(1)),
    `start_time` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `end_time` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `duration` UInt64 CODEC(ZSTD(1)),
    `logs.timestamps` Array(DateTime64(9)) CODEC(ZSTD(1)),
    `logs.labels.names` Array(Array(LowCardinality(String))) CODEC(ZSTD(1)),
    `logs.labels.values` Array(Array(String)) CODEC(ZSTD(1)),
    `references.trace_ids` Array(FixedString(16)) CODEC(ZSTD(1)),
    `references.span_ids` Array(FixedString(8)) CODEC(ZSTD(1)),
    `references.ref_types` Array(Enum8('CHILD_OF' = 0, 'FOLLOWS_FROM' = 1)) CODEC(ZSTD(1)),
    `labels.names` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `labels.values` Array(String) CODEC(ZSTD(1)),
    `warnings` Array(String) CODEC(ZSTD(1)),
    `status` Enum8('UNSET' = 0, 'OK' = 1, 'ERROR' = 2) CODEC(ZSTD(1)),
    `label_names` LowCardinality(String) MATERIALIZED arrayStringConcat(arraySort(labels.names), '|'),
    `label_names2` LowCardinality(String) MATERIALIZED concat('|', arrayStringConcat(arraySort(labels.names), '|'), '|'),
    `label_values` String MATERIALIZED concat('|', arrayStringConcat(arraySort(labels.values), '|'), '|'),
    INDEX ix_bf_trace_id trace_id TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX ix_sbf_labels_values labels.values TYPE sparse_grams(3, 100, 8192, 2, 0) GRANULARITY 1,
    INDEX ix_mm_start_time start_time TYPE minmax GRANULARITY 1,
    INDEX ix_mm_start_time_64 start_time TYPE minmax GRANULARITY 64,
    PROJECTION by_trace_id
    (
        SELECT _part_offset
        ORDER BY trace_id
    ),
    PROJECTION by_duration
    (
        SELECT _part_offset
        ORDER BY duration
    )
)
ENGINE = ReplicatedMergeTree
PARTITION BY toStartOfHour(start_time)
ORDER BY (environment, system, service, task, name, start_time)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1

As far as I can tell (my MATERIALIZE INDEX mutation is done), the index should be in all parts. Is there a way to check for sure whether the index is fully materialized?
I do have start_time in my ORDER BY, and also I have two skip indexes over it with different granularity. Do I understand correctly that this feature will try to use one of them? Is there a log that can tell me which one?

EmeraldShift · 2025-12-03T07:08:00Z

I'll re-test my query on the granularity = 1 commit later today

EmeraldShift · 2025-12-03T14:27:22Z

I can no longer reproduce the segmentation fault on the latest commit 😁

shankar-iyer · 2025-12-03T16:46:50Z

I can no longer reproduce the segmentation fault on the latest commit 😁

Thanks!

EmeraldShift · 2025-12-04T23:06:32Z

Would it be possible to also leverage the primary ORDER BY key if the data is [partially] sorted by the query's ORDER BY? Some random thoughts:

If the primary key contains marks for the query's ORDER BY column, can we efficiently read those marks from memory instead of reading a skip index from disk?
If there are long sorted runs of marks, can we perform the filter against larger groups of marks at a time, similar to generic exclusion search? For example, table ORDER BY (a, b) could have marks (1, 1), (1, 3), (1, 5), (1, 7) and a threshold of b < 2 can skip all of the middle marks. Maybe it is faster to find whole ranges to read?
If the table is sorted by (a, b, c) and the query uses ORDER BY c DESC, and the topK optimization is on, should we read InReverseOrder? Then the first granules to read will fill the threshold early with good values, and we can skip more data. Currently it tries to read either thread or InOrder, which is sub-optimal especially when max_threads=1.

This feature looks great as-is, and is already a huge improvement! I'm mostly asking if this sounds like reasonable follow-on work to continue optimizing this shape of query, which is extremely common in my logging use case! 😁

davenger · 2025-12-08T09:22:59Z

src/Processors/QueryPlan/SortingStep.h

        UInt64 limit_,
-        bool skip_partial_sort = false);
+        bool skip_partial_sort = false,
+        TopKThresholdTrackerPtr threshold_tracker = nullptr);


@shankar-iyer passing TopKThresholdTrackerPtr breaks the ability to serialize/deserialize SortingStep and makes this feature incompatible with distributed queries. Plan step must be a "description" of what is done that can be easily passed over network to a different node and only initializePipeline/updatePipeline methods should convert this description into runtime structures that implement the actual processing logic.
Maybe you can use similar approach to what is done for join runtime filters - they are "registgered" in query context under unique names and looked up by his name at runtime: https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/Context.h#L283

@davenger Thanks for the review and spotting this! Let me check and I will update accordingly.

First pass implementation

659a812

shankar-iyer marked this pull request as draft November 11, 2025 08:50

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Nov 11, 2025

shankar-iyer added 2 commits November 11, 2025 11:09

Merge branch 'master' into top_n_optimization_by_skip_index_and_prewhere

c6d3a9d

use DB::SharedMutex

ff2e1cc

Fix and SettingsChangesHistory

2035664

ahmadov self-assigned this Nov 12, 2025

shankar-iyer added 4 commits November 12, 2025 16:11

Optimizations

a803a53

Moved TopN-by-skip-index to right location and added explain row

8716008

Skip index existence check

9cd57bf

style

8f4d914

checkpoint

dcf2f73

shankar-iyer added 2 commits November 14, 2025 16:49

The 2 settings

203953a

Merge branch 'master' into top_n_optimization_by_skip_index_and_prewhere

ae540d8

rschu1ze requested a review from Copilot November 16, 2025 21:27

Copilot started reviewing on behalf of rschu1ze November 16, 2025 21:27 View session

Copilot finished reviewing on behalf of rschu1ze November 16, 2025 21:30

Copilot AI reviewed Nov 16, 2025

View reviewed changes

use priority_queue for top-n index analysis

2bcf64b

shankar-iyer added 2 commits December 2, 2025 05:52

tests

333aef6

handle qcc with top-k filter

b9a2ec7

Merge branch 'master' into top_n_optimization_by_skip_index_and_prewhere

04e6030

Use skip index with granularity = 1

72366ab

shankar-iyer added 6 commits December 5, 2025 08:09

fixups and new tests

49f256a

Merge branch 'master' into top_n_optimization_by_skip_index_and_prewhere

f766060

Merge branch 'master' into top_n_optimization_by_skip_index_and_prewhere

4872936

limits check

8cccf17

Merge branch 'master' into top_n_optimization_by_skip_index_and_prewhere

1f384b2

Fix wrong check

5f997d2

shankar-iyer enabled auto-merge December 7, 2025 16:34

shankar-iyer added this pull request to the merge queue Dec 7, 2025

Merged via the queue into ClickHouse:master with commit 01e3689 Dec 7, 2025
124 of 130 checks passed

shankar-iyer deleted the top_n_optimization_by_skip_index_and_prewhere branch December 7, 2025 16:50

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 7, 2025

alexey-milovidov added the 🎅 🎁 gift🎄 To make people wonder label Dec 7, 2025

shankar-iyer mentioned this pull request Dec 8, 2025

Secondary index application for MergeTree readers and early exit for order by+limit #65990

Closed

davenger reviewed Dec 8, 2025

View reviewed changes

UnamedRus mentioned this pull request Dec 30, 2025

ORDER BY LIMIT x constraint optimization. #26315

Closed

shankar-iyer mentioned this pull request Jan 1, 2026

Filtering by skip indexes on data reading #81526

Merged

1 task

rschu1ze mentioned this pull request Jan 21, 2026

Follow up work for top-K query processing (PR #89835) #91645

Open

davenger mentioned this pull request Mar 11, 2026

Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K #98607

Open

1 task

Conversation

shankar-iyer commented Nov 11, 2025 • edited by rschu1ze Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Details

Uh oh!

clickhouse-gh bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shankar-iyer commented Nov 11, 2025

Uh oh!

EmeraldShift commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmeraldShift commented Nov 11, 2025

Uh oh!

shankar-iyer commented Nov 13, 2025

Uh oh!

shankar-iyer commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmeraldShift commented Nov 16, 2025

Uh oh!

EmeraldShift commented Nov 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shankar-iyer commented Nov 17, 2025

Uh oh!

EmeraldShift commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmeraldShift commented Nov 17, 2025

Uh oh!

EmeraldShift commented Dec 1, 2025

Uh oh!

shankar-iyer commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmeraldShift commented Dec 2, 2025

Uh oh!

EmeraldShift commented Dec 3, 2025

Uh oh!

EmeraldShift commented Dec 3, 2025

Uh oh!

shankar-iyer commented Dec 3, 2025

Uh oh!

EmeraldShift commented Dec 4, 2025

Uh oh!

Uh oh!

davenger Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

shankar-iyer Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

shankar-iyer commented Nov 11, 2025 •

edited by rschu1ze

Loading

clickhouse-gh bot commented Nov 11, 2025 •

edited

Loading

EmeraldShift commented Nov 11, 2025 •

edited

Loading

shankar-iyer commented Nov 14, 2025 •

edited

Loading

EmeraldShift commented Nov 17, 2025 •

edited

Loading

shankar-iyer commented Dec 2, 2025 •

edited

Loading