Improvements skip index final exact mode (follow up to PR #78350) by shankar-iyer · Pull Request #79661 · ClickHouse/ClickHouse

shankar-iyer · 2025-04-28T11:40:09Z

Background in PR #78350. This PR implements couple of optimizations to reduce query execution time for typical primary key layout in parts (merged or unmerged) -

From the ranges selected by the skip index, determine lowest bound and upper bound of the primary key. Now for each part, consider the part's ranges for intersection only if selected lower/upper bound are within the part's primary key lower/upper bound.
If a part's ranges have to be considered for intersection (after above check), use std::upper_bound() with the selected lower/upper bound to identify the appropriate sub-range of the part that will be further evaluated for intersection.

Both the above steps will significantly reduce the number of memory copies and reduce the number of ranges in the sort step.

Changelog category (leave one):

Not for changelog (changelog entry is not required)

…ection

clickhouse-gh · 2025-04-28T11:40:36Z

Workflow [PR], commit [50f7207]

shankar-iyer · 2025-05-09T05:41:01Z

Where this optimization helps -

1. Assume a table #1 with many parts and all are merged parts with no PK overlap and the PK ranges are -
        part #1 -> 1..100, part #2 -> 101..200, part #3 -> 201..300.
 2. Assume a table #2 with many parts and there is unmerged update activity and the parts look like this -
        part #1 -> 1..100, part #2 -> 60..200, part #3 -> 40..300., part #4 -> 250..400

As an example, assume key range 30-60 is shortlisted by secondary index.

In table #1, the PR ensures that ranges from part #2, part #3 and beyond are just eliminated from
intersection evaluation with a couple of checks

In table #2, the PR ensures that only a small subset of ranges from part #2 and part #3 
encompassing PK range 30-60 are included for intersection evaluation and same applies
for other parts where intersection is feasible.

Now extrapolate above for 1000's of parts and millions of ranges. The first cut implementation considered all ranges from all parts for intersection evaluation. This follow up PR keeps in mind typical real-world table layout and improves latency significantly by cutting down memory copies, sorting time, and range intersection evaluation.

shankar-iyer · 2025-05-12T07:25:21Z

No code changes, just refreshed to master.

shankar-iyer · 2025-05-12T11:12:29Z

Too many failures, hopefully nothing related to PR because it is self-contained. I will investigate shortly!

shankar-iyer · 2025-05-12T16:07:52Z

None of the failures are related to the PR's code location, I am re-running failed jobs.

shankar-iyer · 2025-05-13T00:27:46Z

Only failure after rerun is in AST fuzzer (amd_ubsan) - flaky, tracked by #79451

shankar-iyer · 2025-05-13T08:51:34Z

@nickitat Please review!

nickitat · 2025-05-27T10:50:19Z

src/Processors/QueryPlan/PartsSplitter.cpp

+        /// intersection boundary.
+        auto candidates_start = std::upper_bound(part_candidate_ranges.begin(), part_candidate_ranges.end(),
+                                                 selected_lower_bound);
+        if (unlikely(candidates_start == part_candidate_ranges.end()))


Suggested change

if (unlikely(candidates_start == part_candidate_ranges.end()))

if (candidates_start == part_candidate_ranges.end())

nickitat · 2025-05-27T11:00:19Z

src/Processors/QueryPlan/PartsSplitter.cpp

    }

+    Values selected_highest_value;
+    std::vector<std::set<size_t>> part_selected_ranges(ranges_in_data_parts.size(), std::set<size_t>());


As far as I see, we only use this DS to determine whether the particular granule was selected. Do we really have to store a boolean for each granule, or could we just do a similar binary search over selected ranges?

This DS will only grow in size as per the number of granules was selected by the skip index. Even if doing binary search on selected ranges, we have to somewhere expand the selected ranges into granules, else code becomes complex. My first version of code did not have this DS and I was doing the 2 for loops in a single loop and the code was very messy and difficult to comprehend. The DS explains the code well and should grow just be a small subset of the total granules in the table.

AFAIU, we would expand for only (candidates_end - candidates_start) granules. Maybe it is indeed not so important.
Then, I'd propose replacing set with vector + sort, at least to avoid the large number of small dynamic allocations that a set would make.

nickitat · 2025-05-27T11:12:02Z

src/Processors/QueryPlan/PartsSplitter.cpp

+
+    const PartsRangesIterator selected_lower_bound = selected_ranges[0];
+    /// highest value is important, range & part can be any
+    const PartsRangesIterator selected_upper_bound{selected_highest_value, false, MarkRange{0, 1}, 0,


Maybe it'd make sense to provide additional overloads to comparison functions to avoid such "hacks"

Done, reworked.

nickitat · 2025-05-27T11:15:31Z

src/Processors/QueryPlan/PartsSplitter.cpp

+                                                 selected_lower_bound);
+        if (unlikely(candidates_start == part_candidate_ranges.end()))
+            continue; /// no intersection possible in this part
+        auto candidates_end = std::upper_bound(part_candidate_ranges.begin(), part_candidate_ranges.end(),


IndexAccess has methods like findRightmostMarkLessThanValueInRange that seem to do something very similar - can we use them?

Thanks, done!

shankar-iyer · 2025-05-28T08:25:04Z

Clean CI Run!

shankar-iyer · 2025-05-29T03:57:42Z

Only failure in "[Stateless tests (tsan, s3 storage, 3/3)]" is data-race, tracked by #80866

shankar-iyer added 8 commits April 16, 2025 12:18

Optimize further - check a part's start & end key for possible inters…

8868d87

…ection

Improve comment

0f66ef1

upper_bound implementation checkpoint

afcbc48

Well tested and bencharmked

335fb15

Ready for review

8ee0527

Check bounds of part first

2236eab

Merge branch 'master' into improvements_skip_index_final_exact_mode

3a586b8

Ready for review

ae74597

clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Apr 28, 2025

shankar-iyer assigned nickitat Apr 28, 2025

Merge branch 'master' into improvements_skip_index_final_exact_mode

a2e6635

nickitat reviewed May 27, 2025

View reviewed changes

Code review comments

b3876e3

Replace set<> with vector<?

50f7207

nickitat approved these changes Jun 3, 2025

View reviewed changes

shankar-iyer enabled auto-merge June 3, 2025 15:50

shankar-iyer added this pull request to the merge queue Jun 3, 2025

Merged via the queue into ClickHouse:master with commit 741609f Jun 3, 2025
118 of 121 checks passed

shankar-iyer deleted the improvements_skip_index_final_exact_mode branch June 3, 2025 16:06

robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 3, 2025

shankar-iyer mentioned this pull request Sep 5, 2025

use_skip_indexes_if_final_exact_mode incorrectly applied for skip index over ORDER BY column FINAL #85897

Closed

	if (unlikely(candidates_start == part_candidate_ranges.end()))
	if (candidates_start == part_candidate_ranges.end())

Conversation

shankar-iyer commented Apr 28, 2025

Changelog category (leave one):

Uh oh!

clickhouse-gh bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shankar-iyer commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shankar-iyer commented May 12, 2025

Uh oh!

shankar-iyer commented May 12, 2025

Uh oh!

shankar-iyer commented May 12, 2025

Uh oh!

shankar-iyer commented May 13, 2025

Uh oh!

shankar-iyer commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shankar-iyer commented May 28, 2025

Uh oh!

shankar-iyer commented May 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh bot commented Apr 28, 2025 •

edited

Loading

shankar-iyer commented May 9, 2025 •

edited

Loading