Remove sorting from query plan stream properties#68410
Remove sorting from query plan stream properties#68410KochetovNicolai merged 33 commits intomasterfrom
Conversation
|
This is an automated comment for commit dfe8eb0 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
| Sorting (Stream): __table1.a ASC, __table1.b ASC | ||
| Sorting (Stream): __table1.a ASC, __table1.b ASC | ||
| Sorting (Stream): a ASC, b ASC | ||
| Sorting: __table1.a ASC |
There was a problem hiding this comment.
I think this is correct because for a query select distinct b, a from distinct_in_order_explain order by a you don't get (a, b) order after sorting, even if read-in-order with (a, b).
This is because the sorting step does not know anything about b and will use a stable sort (e.g. will prefer the left most input vs input with smallest b) which will break order.
| Sorting (None) | ||
| Sorting (Sorting for ORDER BY) | ||
| Sorting (Global): plus(a, 1) ASC | ||
| Sorting (Chunk): a ASC |
There was a problem hiding this comment.
Imho, Chunk sorting does not give any helpful information, so I did remove it:
- First of all, I can't guarantee that reading from MergeTree always keeps chunks sorted by PK. Even if so, it is very fragile, e.g. somebody may want to read from different parts in one read task (afiak this is already happening)
- So far,
Chunkmay be used for Sorting (or, maybe,Distinctoptimization, but I think that's wrong), but in both cases we do in-order optimization and requestStreamsorting.
| Expression (Projection) | ||
| Distinct | ||
| Sorting (Sorting for ORDER BY) | ||
| Distinct (Preliminary DISTINCT) |
There was a problem hiding this comment.
The reason of the change is that Preliminary DISTINCT was pushed down over UNION.
I am not sure whether we should remove it in this case, but this was done by remove_redundant_distinct optimization. (why?)
0f50d2a to
a6d53c2
Compare
feb8219 to
1d3027b
Compare
devcrafter
left a comment
There was a problem hiding this comment.
Please see the comments. It's not everything, but can't anymore
| } | ||
|
|
||
| /// merge sorted | ||
| if (input_sort_mode == DataStream::SortScope::Stream && input_sort_desc.hasPrefix(result_description)) |
There was a problem hiding this comment.
This case may not be covered with new implementation
| settings.max_block_size, | ||
| settings.aggregation_in_order_max_block_bytes, | ||
| std::move(group_by_sort_description), | ||
| SortDescription{}, |
There was a problem hiding this comment.
We can delete this parameter from MergingAggregatedStep
|
|
||
| DataStream ITransformingStep::createOutputStream( | ||
| const DataStream & input_stream, | ||
| [[maybe_unused]] const DataStream & input_stream, |
There was a problem hiding this comment.
Should we delete these parameters at all?
There was a problem hiding this comment.
I would like to remove it later, in the next PR.
There are too many steps affected.
There was a problem hiding this comment.
Then probably it makes sense to add TODO comment
| const AggregateDescriptions & aggregates, | ||
| bool should_produce_results_in_order_of_bucket_number, | ||
| SortDescription group_by_sort_description) | ||
| [[maybe_unused]] SortDescription group_by_sort_description) |
| @@ -199,8 +199,6 @@ void optimizeTreeSecondPass(const QueryPlanOptimizationSettings & optimization_s | |||
| } | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
It was moved to applyOrder.cpp, the branch for MergingAggregatedStep
|
|
||
| output_stream.sort_description = std::move(output_sort_description); | ||
| output_stream.sort_scope = DataStream::SortScope::Stream; | ||
| // output_stream.sort_description = std::move(output_sort_description); |
| } | ||
|
|
||
| void ReadFromRemote::enforceSorting(SortDescription output_sort_description) | ||
| void ReadFromRemote::enforceSorting([[maybe_unused]] SortDescription output_sort_description) |
There was a problem hiding this comment.
Remove param, and rename probably to enableMemoryBoundMerging()
| } | ||
|
|
||
| void ReadFromParallelRemoteReplicasStep::enforceSorting(SortDescription output_sort_description) | ||
| void ReadFromParallelRemoteReplicasStep::enforceSorting([[maybe_unused]] SortDescription output_sort_description) |
| bool use_buffering = order_info->limit == 0; | ||
| sorting->convertToFinishSorting(order_info->sort_description_for_merging, use_buffering); | ||
| updateStepsDataStreams(steps_to_update); | ||
| // updateStepsDataStreams(steps_to_update); |
| AggregationInputOrder buildInputOrderInfo(AggregatingStep & aggregating, QueryPlan::Node & node) | ||
| { | ||
| QueryPlan::Node * reading_node = findReadingStep(node, backward_path); | ||
| QueryPlan::Node * reading_node = findReadingStep(node, false); |
| const Names & columns_, | ||
| bool pre_distinct_, /// If is enabled, execute distinct for separate streams. Otherwise, merge streams. | ||
| bool optimize_distinct_in_order_); | ||
| /// If is enabled, execute distinct for separate streams. Otherwise, merge streams. |
There was a problem hiding this comment.
Probably
| /// If is enabled, execute distinct for separate streams. Otherwise, merge streams. | |
| /// If is enabled, execute distinct for separate streams, otherwise for merged streams. |
|
|
||
| DataStream ITransformingStep::createOutputStream( | ||
| const DataStream & input_stream, | ||
| [[maybe_unused]] const DataStream & input_stream, |
There was a problem hiding this comment.
Then probably it makes sense to add TODO comment
| global_ctx->deduplicate_by_columns, | ||
| false /*pre_distinct*/, | ||
| true /*optimize_distinct_in_order TODO: looks like it should be enabled*/); | ||
| false /*pre_distinct*/); |
There was a problem hiding this comment.
In this case, we do not apply in order optimization anymore - which would lead to increased memory consumption during merges, at least
There was a problem hiding this comment.
Oh, that's a good catch!
Looks like we did not have a test for this :(
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240928) * Fix build due to ClickHouse/ClickHouse#65625 * Fix build due to ClickHouse/ClickHouse#68410 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>
|
This PR fixed |
|
Ok, it is only with UPD I believe this assumption is not valid (or, at least, very fragile). We can easily have a situation when the reading task reads from 2 parts and forms a single chunk from it. |
) * [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240928) * Fix build due to ClickHouse/ClickHouse#65625 * Fix build due to ClickHouse/ClickHouse#68410 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Because it is too complicated to support.
CI Settings (Only check the boxes if you know what you are doing):