WHERE condtions optimization leveraging projection feature#57216
WHERE condtions optimization leveraging projection feature#57216suzyw-w wants to merge 37 commits intoClickHouse:masterfrom
Conversation
|
Should be slightly better But there is also potential problem, with case when subquery return big set of keys, index analysis in that case can be slower than reading whole table and filtering. |
|
This is an automated comment for commit 020fb95 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
I also have some plan to implement secondary indices on top of normal projections. It might rely on ColumnLazy. |
|
TODO list:
|
|
Ideally, this "rewrite" should happen on level of part. So each part, will be filtered by list of PK from it's own projection. |
This is in our consideration for the next stage. |
So far we have been limited our usecase of MergeTree Enginee while using projection. Is this a known issue for CollapsingMergeTree and ReplacingMergeTree only? Meaning does MergeTree enginee has the same problem? |
It's known issue that currently projections are not working correctly (but, it's mostly harm aggregation projections) for ReplacingMergeTree and CollapsingMergeTree. Fix is to recalculate projections from result of merge, instead of source projections. For your case, if we want to use |
It sounds like there's still some uncertainty about |
|
@alexey-milovidov Hi, can someone please take a look at this pr? |
|
@UnamedRus @amosbird Hi, as you guys are the expert on this, do we have any more concerns of this change? |
|
@alexey-milovidov @KochetovNicolai @davenger Hi, Can we please have someone review the pr? |
|
@UnamedRus @amosbird @alexey-milovidov @KochetovNicolai @davenger Hi, as this pr has been sitting around for a while, can we please have someone start reviewing this? |
|
I'm not a part of ClickHouse Inc team. (so I can only speak for myself) Overall Idea is good and needs to be implemented. But, i see few concern points:
|
Thank you for your response. I do need some backup here to let clickhouse team start looking at this pr. |
|
@alexey-milovidov @KochetovNicolai @davenger Can we please have someone start review this pr? |
|
@SuzyWangIBMer This pull request didn't pass tests. |
|
@alexey-milovidov Conflicts are resolved now. Current failures seems non-relevant to my change. |
|
Closes as new implementation moving to Analyzer/Passes is done. New PR opened #63207 |
This is a proposal to optimize select query by leveraging projection feature.
Since PROJECTION can effectively create a new primary key for the table, it can increase the searching speed if using properly.
This implementation inserts a new subquery to the original where condition, in order to use the projection feature in more general cases.
For example,
select * from table where secondary_key='-42';will now become
select * from table where primary_key in (select primary_key from test_a where secondary_key='-42') and secondary_key='-42';This implementation is tested and proved increasing query execution speed vastly.
Thanks to @UnamedRus 's suggestion, indexHint() is also added to further optimize the query. Now the query is going to be re-write to
select * from table where indexHint(primary_key in (select primary_key from test_a where secondary_key='-42')) and secondary_key='-42';It is proved that this will also optimize query execution speed. Tests results are attached below.
Pseudo Code
Restrictions
=function.For example,
can be re-written to
However, query like the following will not be recognized, and will not be re-written.
Test result
CREATE TABLE test_a ( `src` String, `dst` String, `other_cols` String, PROJECTION p1 ( SELECT src, dst ORDER BY dst ) ) ENGINE = MergeTree ORDER BY src/* Query using primary key (fastest) */
/* Query does not using projection */
/* Optimized non-projection query now can leveraging projection feature */
palmtops1.fyre.ibm.com :) select * from test_a where src in (select src from test_a where dst='-42') and dst='-42'; SELECT * FROM test_a WHERE (src IN ( SELECT src FROM test_a WHERE dst = '-42' )) AND (dst = '-42') Query id: 68ca1b70-940f-41d2-983b-4df7c06df5ae ┌─src─┬─dst─┬─other_cols───┐ │ 42 │ -42 │ other_col 42 │ └─────┴─────┴──────────────┘ 1 row in set. Elapsed: 0.101 sec. Processed 32.77 thousand rows, 1.18 MB (325.38 thousand rows/s., 11.75 MB/s.) Peak memory usage: 275.28 KiB/* Optimized query with indexHint */
palmtops1.fyre.ibm.com :) select * from test_a where indexHint(src in (select src from test_a where dst='-42')) and dst='-42'; SELECT * FROM test_a WHERE indexHint(src IN ( SELECT src FROM test_a WHERE dst = '-42' )) AND (dst = '-42') Query id: 365d4fae-93dc-48e5-b0c0-42dc2ba9904e ┌─src─┬─dst─┬─other_cols───┐ │ 42 │ -42 │ other_col 42 │ └─────┴─────┴──────────────┘ 1 row in set. Elapsed: 0.094 sec. Processed 32.77 thousand rows, 1.18 MB (350.31 thousand rows/s., 12.65 MB/s.) Peak memory usage: 223.92 KiB.Query using primary key (fastest) : 0.042 sec, Peak memory usage: 44.66 KiB
Query does not using projection : 2.177 sec, Peak memory usage: 1021.77 KiB.
Optimized non-projection query : 0.101 sec, Peak memory usage: 275.28 KiB
Optimized query with indexHint : 0.094 sec, Peak memory usage: 223.92 KiB.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Documentation entry for user-facing changes