Skip to content

Filtering by skip indexes on data reading#81526

Merged
CurtizJ merged 29 commits intoClickHouse:masterfrom
amosbird:apply-skip-index-on-reading
Aug 27, 2025
Merged

Filtering by skip indexes on data reading#81526
CurtizJ merged 29 commits intoClickHouse:masterfrom
amosbird:apply-skip-index-on-reading

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

@amosbird amosbird commented Jun 9, 2025

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support filtering data parts using skip indexes during reading to reduce unnecessary index reads. Controlled by the new setting use_skip_indexes_on_data_read (disabled by default). This addresses #75774. This includes some common groundwork shared with #81021.

An illustrative example of the optimization effect:

create table test (s String, index s_idx s type bloom_filter(0.0001) granularity 1) engine MergeTree order by () settings index_granularity = 1024;

insert into test select if(number % 1024 == 0, 'needle', randomPrintableASCII(64)) from numbers_mt(10000000);

-- 39ms
select * from test where s = 'needle' limit 1 settings max_threads = 1, use_skip_indexes_on_data_read = 0, use_query_condition_cache = 0, send_logs_level = 'test';

-- 16ms
select * from test where s = 'needle' limit 1 settings max_threads = 1, use_skip_indexes_on_data_read = 1, use_query_condition_cache = 0, send_logs_level = 'test';

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 9, 2025

Workflow [PR], commit [d5b5044]

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Jun 9, 2025
@amosbird amosbird force-pushed the apply-skip-index-on-reading branch 3 times, most recently from 60b4e9f to 77abd63 Compare June 10, 2025 05:03
@devcrafter devcrafter self-assigned this Jun 11, 2025
@amosbird amosbird force-pushed the apply-skip-index-on-reading branch from 020c7b8 to dcf2543 Compare June 12, 2025 03:51
@amosbird amosbird mentioned this pull request Jun 21, 2025
1 task
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 22, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 22, 2025

Workflow [PR], commit [3d3791c]

Summary:

@amosbird amosbird force-pushed the apply-skip-index-on-reading branch from 62dd6db to 1aabfed Compare June 26, 2025 01:53
@amosbird amosbird force-pushed the apply-skip-index-on-reading branch from b23ad59 to c5936d7 Compare July 15, 2025 16:39
@amosbird
Copy link
Copy Markdown
Collaborator Author

I've examined all 31 test failures and found them all incompatible with the use_skip_indexes_on_data_read setting due to one of the following reasons:

  1. The number of rows to read is limited by max_rows_to_read.
  2. The index usage is ineffective, as EXPLAIN INDEX shows additional granules being read.
  3. query_plan_join_swap_table cannot swap tables because the plan stage does not utilize indexes.
  4. force_data_skipping_indices fails since the plan stage omits index usage.

I've disabled use_skip_indexes_on_data_read for them explicitly.

Copy link
Copy Markdown
Member

@CurtizJ CurtizJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the idea is good.

Comment on lines +315 to +329
MergeTreeIndexReadResultPtr index_read_result;
if (merge_tree_index_build_context)
{
const auto & part_ranges = merge_tree_index_build_context->read_ranges.at(task->getInfo().part_index_in_query);
auto & remaining_marks = merge_tree_index_build_context->part_remaining_marks.at(task->getInfo().part_index_in_query).value;
index_read_result = merge_tree_index_build_context->index_reader->getOrBuildIndexReadResult(part_ranges);

/// Atomically subtract the number of marks this task will read from the total remaining marks. If the
/// remaining marks after subtraction reach zero, this is the last task for the part, and we can trigger
/// cleanup of any per-part cached resources (e.g., skip index read result).
size_t task_marks = task->getNumMarksToRead();
bool part_last_task = remaining_marks.fetch_sub(task_marks, std::memory_order_acq_rel) == task_marks;
if (part_last_task)
merge_tree_index_build_context->index_reader->clear(task->getInfo().data_part);
}
Copy link
Copy Markdown
Member

@CurtizJ CurtizJ Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you read the index for all ranges of the part here. Is it possible to read only the ranges from the current task?

Also, now this line is incorrect because of this:

bool part_last_task = remaining_marks.fetch_sub(task_marks, std::memory_order_acq_rel) == task_marks;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to read only the ranges from the current task?

This isn't really feasible. First, an index may span multiple granules, which makes alignment difficult. There can also be multiple indexes with different granule boundaries, which adds even more complexity. In addition, reading at the granule level is not IO-friendly, e.g, with minmax indexes the IO granularity becomes too small, which hurts performance. It could also introduce a lot of unnecessary random IO.

Also, now this line is incorrect because of this:

I don't quite understand what you mean by "incorrect." The purpose of this flag is simply to allow the last reader of this part to clean up the shared index structure that was built.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, an index may span multiple granules, which makes alignment difficult. There can also be multiple indexes with different granule boundaries, which adds even more complexity

Yes, that makes sense.

It could also introduce a lot of unnecessary random IO.

I thought that since selected ranges are not contiguous, they would require a random read anyway. But indeed, for minmax and some other indexes, almost all index data may fit into the read buffer and won't cause a random read even for non-contiguous ranges.

I don't quite understand what you mean by "incorrect."

Sorry, I misunderstood one thing.

size_t offset,
Columns & res_columns) override;

bool canReadIncompleteGranules() const override { return main_reader->canReadIncompleteGranules(); }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important for the index reader? I'd expect that canReadIncompleteGranules returns false because this method is not applicable to the index reader. Also returning false will allow to remove dependency from the main_reader.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important, because this behavior only takes effect in the main reader. When the index reader is used, it effectively becomes the main reader, so the flag still matters, especially for skewed wide parts.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep it.

@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented Aug 26, 2025

It looks like the failure in Fast Test is related. Please fix, and I'll merge the PR.

@CurtizJ CurtizJ added this pull request to the merge queue Aug 27, 2025
Merged via the queue into ClickHouse:master with commit d530e1c Aug 27, 2025
122 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 27, 2025
@EmeraldShift
Copy link
Copy Markdown
Contributor

Thank you for this feature, I'm excited to see it merged! Is there any chance it can be included in 25.8?

@alexey-milovidov
Copy link
Copy Markdown
Member

+1, thanks for the feature and the review, my dreams come true!

@EmeraldShift, it is just a few days late for 25.8, so it will be in 25.9.

@rschu1ze
Copy link
Copy Markdown
Member

Issue for broken EXPLAIN indexes = 1: #88467

@shankar-iyer
Copy link
Copy Markdown
Member

@amosbird @CurtizJ @nickitat Greetings! The previous attempt to default enable use_skip_indexes_on_data_read was reverted : #88638. That attempt led to identification and fixes listed here : #88504 (comment) . Right now, we have 2 issues pending to default enable this setting -

  1. Cardinality estimation in join order planning - I am working on that

  2. How do we handle parallel replicas? Reference : link. Based on current implementation, use_skip_indexes_on_data_read=1 will read entire index granules of skip index on each replica where a part's sub-ranges are processed (above comment Filtering by skip indexes on data reading #81526 (comment)). As explained in the comment, this aspect is not a overhead for minmax indexes as they are very small. Vector index does not need use_skip_indexes_on_data_read=1 because static index analysis is best for vector index. I understand that text index has it's own equivalent of index reading at runtime. That leaves set index and bloom filter index. Please let me know your opinion! Is it okay to read/cache the entire index on multiple replicas even if they are processing partial ranges.

There is another feature (use_skip_indexes_for_top_k : #89835) which relies on use_skip_indexes_on_data_read and hence I need to default enable both settings planned for 26.1. That feature uses only minmax index.

@alexey-milovidov
Copy link
Copy Markdown
Member

Is it okay to read/cache the entire index on multiple replicas even if they are processing partial ranges.

Yes, it is ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants