Analyzer: filtering by virtual columns for StorageS3#56668
Conversation
|
This is an automated comment for commit 31a6c7c with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
3256a00 to
bb8c30f
Compare
novikd
left a comment
There was a problem hiding this comment.
The code around filter DAG creation seems to be incorrect.
src/Storages/StorageS3.cpp
Outdated
|
|
||
| header = renameColumnsInBlock(header, column_rename); | ||
| auto filter_dag = buildActionsDAGFromExpressionNode( | ||
| query_node->getWhere(), header.getColumnsWithTypeAndName(), query_info.planner_context); |
There was a problem hiding this comment.
This looks very suspicious. You should not extract filter from the query tree. Filter must be stored in the SelectQueryInfo::filter_actions_dag.
PartitionPruner uses SelectQueryInfo::filter_actions_dag.
There was a problem hiding this comment.
I updated code to use filters from SourceStepWithFilter as it was done in #56391
bb8c30f to
ab99267
Compare
ab99267 to
95e9a27
Compare
|
I don't understand these changes. New filtering with |
| size_t num_streams) | ||
| { | ||
| auto query_configuration = updateConfigurationAndGetCopy(local_context); | ||
| auto read_from_format_info = prepareReadingFromFormat(column_names, storage_snapshot, supportsSubsetOfColumns(local_context), virtual_columns); |
There was a problem hiding this comment.
This read_from_format_info is used only for getting source_header, and then we call prepareReadingFromFormat again in ReadFromStorageS3Step::initializePipeline. Why don't just pass read_from_format_info as an argument in ReadFromStorageS3Step constructor so we don't do the same work twice?
:) create table test (number UInt64) engine=S3('http://localhost:11111/test/data*.jsonl');
:) select count(), _file from test where _file='data1.jsonl' group by _file settings allow_experimental_analyzer = 1;
┌─count()─┬─_file───────┐
│ 10 │ data1.jsonl │
└─────────┴─────────────┘
[avogar-dev] 2023.12.29 13:49:56 [ 0 ] EngineFileLikeReadFiles: 2 (increment)So we still read 2 files instead of 1. But when we do the same using table function, it works: :) select count(), _file from s3('http://localhost:11111/test/data*.jsonl') where _file = 'data1.jsonl' group by _file settings allow_experimental_analyzer=1
┌─count()─┬─_file───────┐
│ 10 │ data1.jsonl │
└─────────┴─────────────┘
[avogar-dev] 2023.12.29 13:51:49 [ 0 ] EngineFileLikeReadFiles: 1 (increment)What is the difference between reading from storage and table function in analyzer? |
|
And should we do smth similar with other engines like |
|
And also new functions If you don't have time to do it, tell me, I will do it. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
...