Filtering by skip indexes on data reading#81526
Conversation
60b4e9f to
77abd63
Compare
020c7b8 to
dcf2543
Compare
62dd6db to
1aabfed
Compare
b23ad59 to
c5936d7
Compare
|
I've examined all 31 test failures and found them all incompatible with the
I've disabled |
| MergeTreeIndexReadResultPtr index_read_result; | ||
| if (merge_tree_index_build_context) | ||
| { | ||
| const auto & part_ranges = merge_tree_index_build_context->read_ranges.at(task->getInfo().part_index_in_query); | ||
| auto & remaining_marks = merge_tree_index_build_context->part_remaining_marks.at(task->getInfo().part_index_in_query).value; | ||
| index_read_result = merge_tree_index_build_context->index_reader->getOrBuildIndexReadResult(part_ranges); | ||
|
|
||
| /// Atomically subtract the number of marks this task will read from the total remaining marks. If the | ||
| /// remaining marks after subtraction reach zero, this is the last task for the part, and we can trigger | ||
| /// cleanup of any per-part cached resources (e.g., skip index read result). | ||
| size_t task_marks = task->getNumMarksToRead(); | ||
| bool part_last_task = remaining_marks.fetch_sub(task_marks, std::memory_order_acq_rel) == task_marks; | ||
| if (part_last_task) | ||
| merge_tree_index_build_context->index_reader->clear(task->getInfo().data_part); | ||
| } |
There was a problem hiding this comment.
Now you read the index for all ranges of the part here. Is it possible to read only the ranges from the current task?
Also, now this line is incorrect because of this:
bool part_last_task = remaining_marks.fetch_sub(task_marks, std::memory_order_acq_rel) == task_marks;
There was a problem hiding this comment.
Is it possible to read only the ranges from the current task?
This isn't really feasible. First, an index may span multiple granules, which makes alignment difficult. There can also be multiple indexes with different granule boundaries, which adds even more complexity. In addition, reading at the granule level is not IO-friendly, e.g, with minmax indexes the IO granularity becomes too small, which hurts performance. It could also introduce a lot of unnecessary random IO.
Also, now this line is incorrect because of this:
I don't quite understand what you mean by "incorrect." The purpose of this flag is simply to allow the last reader of this part to clean up the shared index structure that was built.
There was a problem hiding this comment.
First, an index may span multiple granules, which makes alignment difficult. There can also be multiple indexes with different granule boundaries, which adds even more complexity
Yes, that makes sense.
It could also introduce a lot of unnecessary random IO.
I thought that since selected ranges are not contiguous, they would require a random read anyway. But indeed, for minmax and some other indexes, almost all index data may fit into the read buffer and won't cause a random read even for non-contiguous ranges.
I don't quite understand what you mean by "incorrect."
Sorry, I misunderstood one thing.
| size_t offset, | ||
| Columns & res_columns) override; | ||
|
|
||
| bool canReadIncompleteGranules() const override { return main_reader->canReadIncompleteGranules(); } |
There was a problem hiding this comment.
Is it important for the index reader? I'd expect that canReadIncompleteGranules returns false because this method is not applicable to the index reader. Also returning false will allow to remove dependency from the main_reader.
There was a problem hiding this comment.
It is important, because this behavior only takes effect in the main reader. When the index reader is used, it effectively becomes the main reader, so the flag still matters, especially for skewed wide parts.
|
It looks like the failure in Fast Test is related. Please fix, and I'll merge the PR. |
|
Thank you for this feature, I'm excited to see it merged! Is there any chance it can be included in 25.8? |
|
+1, thanks for the feature and the review, my dreams come true! @EmeraldShift, it is just a few days late for 25.8, so it will be in 25.9. |
|
Issue for broken |
|
@amosbird @CurtizJ @nickitat Greetings! The previous attempt to default enable
There is another feature ( |
Yes, it is ok. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support filtering data parts using skip indexes during reading to reduce unnecessary index reads. Controlled by the new setting
use_skip_indexes_on_data_read(disabled by default). This addresses #75774. This includes some common groundwork shared with #81021.An illustrative example of the optimization effect:
Documentation entry for user-facing changes