[Improvement](topn) runtime prune for topn query #15558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

Gabriel39 merged 27 commits into apache:master from xiaokang:topn_runtime_prune

Jan 5, 2023

Contributor

xiaokang commented Jan 3, 2023

Proposed changes

Issue Number: close #xxx

Problem summary

Describe your changes.

This PR optimize topn query like SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N.

Sort node use the intermediate result of topn to generate a predicate dynamically for the ORDER BY column, eg. columnA in the example query. Then scan node use the predicate to filter segment file, page, rows to reduce the data to be read dramatically.

Checklist(Required)

Does it affect the original behavior:
- Yes
- No
- I don't know
Has unit tests been added:
- Yes
- No
- No Need
Has document been added or modified:
- Yes
- No
- No Need
Does it need to update dependencies:
- Yes
- No
Are there any changes that cannot be rolled back:
- Yes (If Yes, please explain WHY)
- No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

github-actions bot added area/planner area/vectorization labels

github-actions bot reviewed

View reviewed changes

Contributor

github-actions bot left a comment

clang-tidy made some suggestions

be/src/runtime/runtime_predicate.cpp Outdated Show resolved Hide resolved

be/src/runtime/runtime_predicate.h Outdated Show resolved Hide resolved

Contributor

hello-stephen commented Jan 3, 2023 •

edited

Loading

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35.31 seconds
load time: 480 seconds
storage size: 17122490937 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230105101438_clickbench_pr_74161.html

kpfly commented Jan 3, 2023

need support on nereids optimizer

Gabriel39 reviewed

View reviewed changes

be/src/vec/exec/vsort_node.cpp Outdated Show resolved Hide resolved

be/src/vec/exec/vsort_node.cpp Outdated Show resolved Hide resolved

be/src/runtime/runtime_predicate.cpp Outdated Show resolved Hide resolved

be/src/runtime/runtime_predicate.h Outdated Show resolved Hide resolved

be/src/runtime/runtime_predicate.cpp Outdated

    
                  TCondition condition;

                  condition.__set_column_name(col_name);

                  condition.__set_column_unique_id(_tablet_schema->column(col_name).unique_id());

                  condition.__set_condition_op(is_reverse ? ">=" : "<=");

Contributor

Gabriel39 Jan 3, 2023

Should we use ">" and "<" instead?

Contributor Author

xiaokang Jan 3, 2023

We should include the boundary value here. If the query contains multiple sort columns and the first column has a single value, we should include this value.

be/src/runtime/runtime_predicate.cpp Outdated Show resolved Hide resolved

be/src/runtime/runtime_predicate.cpp Show resolved Hide resolved

be/src/runtime/runtime_predicate.h

    
                  std::shared_ptr<ColumnPredicate> get_predictate() {

                      std::shared_lock<std::shared_mutex> rlock(_rwlock);

                      return _predictate;

Contributor

Gabriel39 Jan 3, 2023

Better to use unique_ptr and return a pure pointer here

Contributor Author

xiaokang Jan 4, 2023

Yes, unique_ptr is more efficient. But there is a life cycle problem that lead my choice for shared_ptr.

In update function _predictate will be reset if new boundary value found, then segment iterators may use a deleted pointer if a pure pointer is used.

Contributor

github-actions bot commented Jan 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

hf200012 added the dev/1.2.1 label

Contributor

github-actions bot commented Jan 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

Gabriel39 reviewed

View reviewed changes

be/src/runtime/primitive_type.h Outdated Show resolved Hide resolved

Contributor

github-actions bot commented Jan 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment

Contributor

github-actions bot commented Jan 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

xiaokang added 18 commits

January 5, 2023 14:42


          topn detail query optimization: runtime predicate for rows

bbf03eb


          add missing new files runtime_predicate.h/cpp

42f08e5


          topn detail query opt: runtime predicate for segments and pages

b4d43a6


          topn detail query opt: fe check query to enable opt

c7944a4


          do not use topn opt for order by string type

8370a7f


          fix nullpointer when check type

9d7e2ad


          support more types and using TypeIndex in runtime_predicate

206ec7f


          use HeapSorter when _use_topn_opt or no var length field

a2183fe


          remove debug log

9be53a1


          fix compile error

ef41c68


          move topn before sink in sort node to check topn field

0b8e6b9


          clang format

a0ff37b


          clang tidy

5da1e4f


          use getMaterializedOrderingExprs instead of getOrderingExprs

ecad665

to avoid mismatch query like ORDER BY col*10


          move topn opt check to sink function of SortNode

17e2e21


          change vector to single field for runtime predicate

49d81f6


          1. use function obj instead of switch case

a137066

2. add _init function for RuntimePredicate


          clang format

68b24bc

xiaokang added 4 commits

January 5, 2023 14:42


          clang format

9cc50eb


          only set topn opt for vec mode

183da6f


          minor changes for _runtime_state and old_top in VSortNode

200afce


          delete unused function get_primitive_type

ff48825

xiaokang force-pushed the topn_runtime_prune branch from 6cb8e88 to ff48825 Compare

January 5, 2023 06:42

Contributor

github-actions bot commented Jan 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

Gabriel39 reviewed

View reviewed changes

be/src/runtime/runtime_predicate.h Outdated Show resolved Hide resolved

Gabriel39 reviewed

View reviewed changes

be/src/runtime/runtime_predicate.h Outdated Show resolved Hide resolved

Gabriel39 reviewed

View reviewed changes

be/src/runtime/runtime_predicate.cpp Outdated Show resolved Hide resolved

xiaokang and others added 3 commits

January 5, 2023 15:39


          Update be/src/runtime/runtime_predicate.h

c04082d

Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>


          Update be/src/runtime/runtime_predicate.h

81ce5e6

Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>


          Update be/src/runtime/runtime_predicate.cpp

1c71c4e

Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>

Contributor

github-actions bot commented Jan 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

github-actions bot reviewed

View reviewed changes

Contributor

github-actions bot left a comment

clang-tidy made some suggestions

be/src/runtime/runtime_predicate.cpp Show resolved Hide resolved

Contributor

github-actions bot commented Jan 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

Gabriel39 approved these changes

View reviewed changes

Contributor

Gabriel39 left a comment

LGTM

github-actions bot added the approved label

Contributor

github-actions bot commented Jan 5, 2023

PR approved by at least one committer and no changes requested.

github-actions bot added the reviewed label

Contributor

github-actions bot commented Jan 5, 2023

PR approved by anyone and no changes requested.

Gabriel39 merged commit 9d1f02c into apache:master

luozenglin mentioned this pull request

[fix](predicate) fix be core dump caused by pushing down the double column predicate #15693

Merged

13 tasks

xiaokang mentioned this pull request

[testcase](topn)add test cases for nonkey topn query for each scalar type #15790

Merged

13 tasks

morningman pushed a commit that referenced this pull request


          [regression-test](topn)add test cases for nonkey topn query for each …

806cd9f

…scalar type (#15790)

related to #15558 #15693
1. dup key table with 17 scalar datatypes
2. unique key table with mow enabled
3. unique key table with mow disabled

morningman added dev/1.2.2 and removed dev/1.2.1 dev/1.2.2 labels

englefly mentioned this pull request

[enhancement](nereids)support topN opt in nereids #17741

Merged

5 tasks

luwei16 pushed a commit to luwei16/incubator-doris that referenced this pull request


          [pick](topn) Sync and pick topn from doris master (apache#1431)

652ccc7

Issue Number: close http://jira.selectdb-in.cc/browse/CORE-1462

Describe the overview of changes.

commit e1697741a82f875ca42b0d18caa7972eaa225bee
Author: Kang <kxiao.tiger@gmail.com>
Date:   Thu Jan 19 22:59:29 2023 +0800

    [opt](test) scalar_types_p0 use 100k lines dataset and scalar_types_p2 use 1000k (apache#16104)

commit 33a47e8d02644123ffd8c5c4353653c1c175e96a
Author: Kang <kxiao.tiger@gmail.com>
Date:   Wed Jan 18 14:17:24 2023 +0800

    [testcase](bitmap index)bitmap index testcase (apache#15975)

    * add bitmap index testcases for all scalar types

commit 260a631441834ca7e23da4b77c922eb818eddca7
Author: Kang <kxiao.tiger@gmail.com>
Date:   Mon Jan 16 16:49:59 2023 +0800

    [regression-test](topn)add test cases for nonkey topn query for each scalar type (apache#15790)

    related to apache#15558 apache#15693
    1. dup key table with 17 scalar datatypes
    2. unique key table with mow enabled
    3. unique key table with mow disabled

commit 81cea5219ae86df950f10aa123072df78c7cdf23
Author: Kang <kxiao.tiger@gmail.com>
Date:   Sun Feb 19 23:28:33 2023 +0800

    [bugfix](topn) fix topn read_orderby_key_columns nullptr (apache#16896)

    The SQL `SELECT nationkey FROM regression_test_query_p0_limit.tpch_tiny_nation ORDER BY nationkey DESC LIMIT 5`
    make be core dump since dereference a nullptr `read_orderby_key_columns in VCollectIterator::_topn_next`,
    triggered by skipping _colname_to_value_range init in apache#16818 .

    This PR makes two changes:
    1. avoid read_orderby_key_columns nullptr in TabletReader::_init_orderby_keys_param
    2. return error if read_orderby_key_columns is nullptr unexpected in VCollectIterator::_topn_next to avoid core dump

commit 2fee1d1d79942e49eddaafdc2b49e49b0651b109
Author: Kang <kxiao.tiger@gmail.com>
Date:   Fri Feb 10 12:56:33 2023 +0800

    [Improvement](topn) add limit threashold session variable and fuzzy for topn optimizations (apache#16514)

    1. add limit threshold for topn runtime pushdown and key topn optimization
    2. use unified session variable topn_opt_limit_threshold for all topn optimizations
    3. add fuzzy support for topn_opt_limit_threshold

commit 1696bed39129fcc891f32f64ff1fb43f9531fcd4
Author: Kang <kxiao.tiger@gmail.com>
Date:   Thu Feb 2 09:13:32 2023 +0800

    [bugfix](topn) fix topn runtime predicate getting value bug for decimal type (apache#16331)

    * fix topn runtime predicate getting value bug for decimal type

    * fix cast_to_string bug for TYPE_DECIMALV2

commit d70cdf61521a23417c9bc734a3cdb668265a15b0
Author: Kang <kxiao.tiger@gmail.com>
Date:   Wed Feb 22 16:18:46 2023 +0800

    topn sync doris order by key topn query optimization apache#15663

commit 1df514c8f0b66ae9a8438617163a31848e519949
Author: Kang <kxiao.tiger@gmail.com>
Date:   Wed Feb 22 15:14:43 2023 +0800

    sync with doris runtime prune for topn query apache#15558

swjtu-zhanglei pushed a commit to swjtu-zhanglei/incubator-doris that referenced this pull request


          [pick](topn) Sync and pick topn from doris master (apache#1431)

99fbafb

Issue Number: close http://jira.selectdb-in.cc/browse/CORE-1462

Describe the overview of changes.

commit e1697741a82f875ca42b0d18caa7972eaa225bee
Author: Kang <kxiao.tiger@gmail.com>
Date:   Thu Jan 19 22:59:29 2023 +0800

    [opt](test) scalar_types_p0 use 100k lines dataset and scalar_types_p2 use 1000k (apache#16104)

commit 33a47e8d02644123ffd8c5c4353653c1c175e96a
Author: Kang <kxiao.tiger@gmail.com>
Date:   Wed Jan 18 14:17:24 2023 +0800

    [testcase](bitmap index)bitmap index testcase (apache#15975)

    * add bitmap index testcases for all scalar types

commit 260a631441834ca7e23da4b77c922eb818eddca7
Author: Kang <kxiao.tiger@gmail.com>
Date:   Mon Jan 16 16:49:59 2023 +0800

    [regression-test](topn)add test cases for nonkey topn query for each scalar type (apache#15790)

    related to apache#15558 apache#15693
    1. dup key table with 17 scalar datatypes
    2. unique key table with mow enabled
    3. unique key table with mow disabled

commit 81cea5219ae86df950f10aa123072df78c7cdf23
Author: Kang <kxiao.tiger@gmail.com>
Date:   Sun Feb 19 23:28:33 2023 +0800

    [bugfix](topn) fix topn read_orderby_key_columns nullptr (apache#16896)

    The SQL `SELECT nationkey FROM regression_test_query_p0_limit.tpch_tiny_nation ORDER BY nationkey DESC LIMIT 5`
    make be core dump since dereference a nullptr `read_orderby_key_columns in VCollectIterator::_topn_next`,
    triggered by skipping _colname_to_value_range init in apache#16818 .

    This PR makes two changes:
    1. avoid read_orderby_key_columns nullptr in TabletReader::_init_orderby_keys_param
    2. return error if read_orderby_key_columns is nullptr unexpected in VCollectIterator::_topn_next to avoid core dump

commit 2fee1d1d79942e49eddaafdc2b49e49b0651b109
Author: Kang <kxiao.tiger@gmail.com>
Date:   Fri Feb 10 12:56:33 2023 +0800

    [Improvement](topn) add limit threashold session variable and fuzzy for topn optimizations (apache#16514)

    1. add limit threshold for topn runtime pushdown and key topn optimization
    2. use unified session variable topn_opt_limit_threshold for all topn optimizations
    3. add fuzzy support for topn_opt_limit_threshold

commit 1696bed39129fcc891f32f64ff1fb43f9531fcd4
Author: Kang <kxiao.tiger@gmail.com>
Date:   Thu Feb 2 09:13:32 2023 +0800

    [bugfix](topn) fix topn runtime predicate getting value bug for decimal type (apache#16331)

    * fix topn runtime predicate getting value bug for decimal type

    * fix cast_to_string bug for TYPE_DECIMALV2

commit d70cdf61521a23417c9bc734a3cdb668265a15b0
Author: Kang <kxiao.tiger@gmail.com>
Date:   Wed Feb 22 16:18:46 2023 +0800

    topn sync doris order by key topn query optimization apache#15663

commit 1df514c8f0b66ae9a8438617163a31848e519949
Author: Kang <kxiao.tiger@gmail.com>
Date:   Wed Feb 22 15:14:43 2023 +0800

    sync with doris runtime prune for topn query apache#15558

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/planner area/vectorization reviewed