-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. #16818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
be/src/vec/columns/column_string.cpp
Outdated
| // https://github.com/ClickHouse/ClickHouse/blob/master/src/Columns/IColumn.h | ||
| // and modified by Doris | ||
|
|
||
| #pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: #pragma once in main file [clang-diagnostic-pragma-once-outside-header]
#pragma once
^
be/src/vec/columns/column_string.cpp
Outdated
| ColumnPtrWrapper(vectorized::ColumnPtr col) : column_ptr(col) {} | ||
| }; | ||
| } // namespace doris | ||
| [chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'chenqi' [clang-diagnostic-error]
[chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp
^
be/src/vec/columns/column_string.cpp
Outdated
| ColumnPtrWrapper(vectorized::ColumnPtr col) : column_ptr(col) {} | ||
| }; | ||
| } // namespace doris | ||
| [chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: '$' in identifier [clang-diagnostic-dollar-in-identifier-extension]
[chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp
^
be/src/vec/columns/column_string.cpp
Outdated
| ColumnPtrWrapper(vectorized::ColumnPtr col) : column_ptr(col) {} | ||
| }; | ||
| } // namespace doris | ||
| [chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: C++ requires a type specifier for all declarations [clang-diagnostic-error]
[chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp
^
be/src/vec/columns/column_string.cpp
Outdated
| ColumnPtrWrapper(vectorized::ColumnPtr col) : column_ptr(col) {} | ||
| }; | ||
| } // namespace doris | ||
| [chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: expected ';' after top level declarator [clang-diagnostic-error]
| [chenqi@VM-0-101-centos doris-master]$ cat be/src/vec/columns/column_string.cpp | |
| [chenqi@VM-0-101-centos doris-master]$; cat be/src/vec/columns/column_string.cpp |
be/src/vec/columns/column_string.cpp
Outdated
| const size_t old_size = chars.size(); | ||
| const size_t new_size = old_size + string_size; | ||
| check_chars_length(new_size, offsets.size() + 1); | ||
| check_chars_length(new_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: too few arguments to function call, expected 2, have 1 [clang-diagnostic-error]
check_chars_length(new_size);
^be/src/vec/columns/column_string.h:63: 'check_chars_length' declared here
void ALWAYS_INLINE check_chars_length(size_t total_length, size_t element_number) const {
^
be/src/vec/columns/column_string.cpp
Outdated
| new_chars_size += size_at(indexes[i]); | ||
| } | ||
| check_chars_length(new_chars_size, limit); | ||
| check_chars_length(new_chars_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: too few arguments to function call, expected 2, have 1 [clang-diagnostic-error]
check_chars_length(new_chars_size);
^be/src/vec/columns/column_string.h:63: 'check_chars_length' declared here
void ALWAYS_INLINE check_chars_length(size_t total_length, size_t element_number) const {
^
be/src/vec/columns/column_string.cpp
Outdated
| } | ||
|
|
||
| check_chars_length(res_chars.size(), res_offsets.size()); | ||
| check_chars_length(res_chars.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: too few arguments to function call, expected 2, have 1 [clang-diagnostic-error]
check_chars_length(res_chars.size());
^be/src/vec/columns/column_string.h:63: 'check_chars_length' declared here
void ALWAYS_INLINE check_chars_length(size_t total_length, size_t element_number) const {
^
be/src/vec/columns/column_string.cpp
Outdated
| } | ||
|
|
||
| check_chars_length(res_chars.size(), res_offsets.size()); | ||
| check_chars_length(res_chars.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: too few arguments to function call, expected 2, have 1 [clang-diagnostic-error]
check_chars_length(res_chars.size());
^be/src/vec/columns/column_string.h:63: 'check_chars_length' declared here
void ALWAYS_INLINE check_chars_length(size_t total_length, size_t element_number) const {
^| } | ||
|
|
||
| // For insert_many_strings_overflow | ||
| _dict_data.resize(total_length + MAX_STRINGS_OVERFLOW_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'MAX_STRINGS_OVERFLOW_SIZE' [clang-diagnostic-error]
_dict_data.resize(total_length + MAX_STRINGS_OVERFLOW_SIZE);
^6a9090d to
9097d77
Compare
|
clang-tidy review says "All clean, LGTM! 👍" |
daf7807 to
1595c7d
Compare
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
… improve performance. 1. Improve 2x performance for small dict string by aligned copying. 2. Refactor code to decrease condition(if) checking. 3. Don't call skip(0). 4. Don't read page index if no condition.
1595c7d to
075feef
Compare
|
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
be/src/vec/columns/column.h
Outdated
| LOG(FATAL) << "Method insert_many_strings_overflow is not supported for " << get_name(); | ||
| } | ||
|
|
||
| virtual void insert_many_strings_overflow(const StringRef* strings, size_t num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: class member cannot be redeclared [clang-diagnostic-error]
virtual void insert_many_strings_overflow(const StringRef* strings, size_t num,
^be/src/vec/columns/column.h:267: previous definition is here
virtual void insert_many_strings_overflow(const StringRef* strings, size_t num,
^|
|
||
| #define MAX_STRINGS_OVERFLOW_SIZE 128 | ||
| template <typename T, size_t copy_length> | ||
| void insert_many_strings_fixed_length(const StringRef* strings, size_t num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: class member cannot be redeclared [clang-diagnostic-error]
void insert_many_strings_fixed_length(const StringRef* strings, size_t num)
^be/src/vec/columns/column_string.h:284: previous declaration is here
void insert_many_strings_fixed_length(const StringRef* strings, size_t num)
^| __attribute__((noinline)); | ||
|
|
||
| template <size_t copy_length> | ||
| void insert_many_strings_fixed_length(const StringRef* strings, size_t num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: class member cannot be redeclared [clang-diagnostic-error]
void insert_many_strings_fixed_length(const StringRef* strings, size_t num) {
^be/src/vec/columns/column_string.h:288: previous definition is here
void insert_many_strings_fixed_length(const StringRef* strings, size_t num) {
^| chars.resize(old_size + new_size); | ||
| } | ||
|
|
||
| void insert_many_strings_overflow(const StringRef* strings, size_t num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: class member cannot be redeclared [clang-diagnostic-error]
void insert_many_strings_overflow(const StringRef* strings, size_t num,
^be/src/vec/columns/column_string.h:311: previous definition is here
void insert_many_strings_overflow(const StringRef* strings, size_t num,
^| class BaseDictDecoder : public Decoder { | ||
| public: | ||
| BaseDictDecoder() = default; | ||
| virtual ~BaseDictDecoder() override = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'virtual' is redundant since the function is already declared 'override' [modernize-use-override]
| virtual ~BaseDictDecoder() override = default; | |
| ~BaseDictDecoder() override = default; |
| } | ||
|
|
||
| template <typename CppType, typename ColumnType> | ||
| Status _decode_date(MutableColumnPtr& doris_column, ColumnSelectVector& select_vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'ColumnSelectVector' [clang-diagnostic-error]
Status _decode_date(MutableColumnPtr& doris_column, ColumnSelectVector& select_vector) {
^| } | ||
|
|
||
| template <typename CppType, typename ColumnType> | ||
| Status _decode_datetime64(MutableColumnPtr& doris_column, ColumnSelectVector& select_vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'ColumnSelectVector' [clang-diagnostic-error]
Status _decode_datetime64(MutableColumnPtr& doris_column, ColumnSelectVector& select_vector) {
^| } | ||
|
|
||
| template <typename CppType, typename ColumnType> | ||
| Status _decode_datetime96(MutableColumnPtr& doris_column, ColumnSelectVector& select_vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'ColumnSelectVector' [clang-diagnostic-error]
Status _decode_datetime96(MutableColumnPtr& doris_column, ColumnSelectVector& select_vector) {
^|
|
||
| template <typename DecimalPrimitiveType, typename DecimalPhysicalType> | ||
| Status _decode_primitive_decimal(MutableColumnPtr& doris_column, DataTypePtr& data_type, | ||
| ColumnSelectVector& select_vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'ColumnSelectVector' [clang-diagnostic-error]
ColumnSelectVector& select_vector) {
^| return Status::OK(); | ||
| } | ||
|
|
||
| tparquet::Type::type _physical_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'tparquet' [clang-diagnostic-error]
tparquet::Type::type _physical_type;
^4097da1 to
61b6fbb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
| .insert_many_dict_data(&dict_items[0], dict_items.size()); | ||
| } | ||
| _indexes.resize(non_null_size); | ||
| _index_batch_decoder->GetBatch(&_indexes[0], non_null_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier '_indexes' [clang-diagnostic-error]
_index_batch_decoder->GetBatch(&_indexes[0], non_null_size);
^| if constexpr (std::is_same_v<T, PHYSICAL_TYPE>) { \ | ||
| return _decode_numeric<CPP_NUMERIC_TYPE>(doris_column, select_vector); \ | ||
| } | ||
| FOR_LOGICAL_NUMERIC_TYPES(DISPATCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'DISPATCH' [clang-diagnostic-error]
FOR_LOGICAL_NUMERIC_TYPES(DISPATCH)
^| if constexpr (std::is_same_v<T, PHYSICAL_TYPE>) { \ | ||
| return _decode_numeric<CPP_NUMERIC_TYPE>(doris_column, select_vector); \ | ||
| } | ||
| FOR_LOGICAL_NUMERIC_TYPES(DISPATCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: expected ';' after expression [clang-diagnostic-error]
| FOR_LOGICAL_NUMERIC_TYPES(DISPATCH) | |
| FOR_LOGICAL_NUMERIC_TYPES(DISPATCH); |
| } | ||
| break; | ||
| case TypeIndex::DateTime: | ||
| if constexpr (std::is_same_v<T, ParquetInt96>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'ParquetInt96' [clang-diagnostic-error]
if constexpr (std::is_same_v<T, ParquetInt96>) {
^61b6fbb to
079f30a
Compare
f5c2798 to
67576d1
Compare
The SQL `SELECT nationkey FROM regression_test_query_p0_limit.tpch_tiny_nation ORDER BY nationkey DESC LIMIT 5` make be core dump since dereference a nullptr `read_orderby_key_columns in VCollectIterator::_topn_next`, triggered by skipping _colname_to_value_range init in #16818 . This PR makes two changes: 1. avoid read_orderby_key_columns nullptr in TabletReader::_init_orderby_keys_param 2. return error if read_orderby_key_columns is nullptr unexpected in VCollectIterator::_topn_next to avoid core dump
|
LGTM |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| __attribute__((noinline)); | ||
|
|
||
| template <size_t copy_length> | ||
| void insert_many_strings_fixed_length(const StringRef* strings, size_t num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some comment about these methods.
eg, when to use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will add it in the next PR.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
The SQL `SELECT nationkey FROM regression_test_query_p0_limit.tpch_tiny_nation ORDER BY nationkey DESC LIMIT 5` make be core dump since dereference a nullptr `read_orderby_key_columns in VCollectIterator::_topn_next`, triggered by skipping _colname_to_value_range init in apache#16818 . This PR makes two changes: 1. avoid read_orderby_key_columns nullptr in TabletReader::_init_orderby_keys_param 2. return error if read_orderby_key_columns is nullptr unexpected in VCollectIterator::_topn_next to avoid core dump
… improve performance. (apache#16818) Optimize and refactor parquet reader to improve performance. - Improve 2x performance for small dict string by aligned copying. - Refactor code to decrease condition(if) checking. - Don't call skip(0). - Don't read page index if no condition. **ssb-flat-100**: (single-machine, single-thread) | Query | before opt | after opt | | ------------- |:-------------:| ---------:| | SELECT count(lo_revenue) FROM lineorder_flat | 9.23 | 9.12 | | SELECT count(lo_linenumber) FROM lineorder_flat | 4.50 | 4.36 | | SELECT count(c_name) FROM lineorder_flat | 18.22 | 17.88| | **SELECT count(lo_shipmode) FROM lineorder_flat** |**10.09** | **6.15**|
Issue Number: close http://jira.selectdb-in.cc/browse/CORE-1462 Describe the overview of changes. commit e1697741a82f875ca42b0d18caa7972eaa225bee Author: Kang <kxiao.tiger@gmail.com> Date: Thu Jan 19 22:59:29 2023 +0800 [opt](test) scalar_types_p0 use 100k lines dataset and scalar_types_p2 use 1000k (apache#16104) commit 33a47e8d02644123ffd8c5c4353653c1c175e96a Author: Kang <kxiao.tiger@gmail.com> Date: Wed Jan 18 14:17:24 2023 +0800 [testcase](bitmap index)bitmap index testcase (apache#15975) * add bitmap index testcases for all scalar types commit 260a631441834ca7e23da4b77c922eb818eddca7 Author: Kang <kxiao.tiger@gmail.com> Date: Mon Jan 16 16:49:59 2023 +0800 [regression-test](topn)add test cases for nonkey topn query for each scalar type (apache#15790) related to apache#15558 apache#15693 1. dup key table with 17 scalar datatypes 2. unique key table with mow enabled 3. unique key table with mow disabled commit 81cea5219ae86df950f10aa123072df78c7cdf23 Author: Kang <kxiao.tiger@gmail.com> Date: Sun Feb 19 23:28:33 2023 +0800 [bugfix](topn) fix topn read_orderby_key_columns nullptr (apache#16896) The SQL `SELECT nationkey FROM regression_test_query_p0_limit.tpch_tiny_nation ORDER BY nationkey DESC LIMIT 5` make be core dump since dereference a nullptr `read_orderby_key_columns in VCollectIterator::_topn_next`, triggered by skipping _colname_to_value_range init in apache#16818 . This PR makes two changes: 1. avoid read_orderby_key_columns nullptr in TabletReader::_init_orderby_keys_param 2. return error if read_orderby_key_columns is nullptr unexpected in VCollectIterator::_topn_next to avoid core dump commit 2fee1d1d79942e49eddaafdc2b49e49b0651b109 Author: Kang <kxiao.tiger@gmail.com> Date: Fri Feb 10 12:56:33 2023 +0800 [Improvement](topn) add limit threashold session variable and fuzzy for topn optimizations (apache#16514) 1. add limit threshold for topn runtime pushdown and key topn optimization 2. use unified session variable topn_opt_limit_threshold for all topn optimizations 3. add fuzzy support for topn_opt_limit_threshold commit 1696bed39129fcc891f32f64ff1fb43f9531fcd4 Author: Kang <kxiao.tiger@gmail.com> Date: Thu Feb 2 09:13:32 2023 +0800 [bugfix](topn) fix topn runtime predicate getting value bug for decimal type (apache#16331) * fix topn runtime predicate getting value bug for decimal type * fix cast_to_string bug for TYPE_DECIMALV2 commit d70cdf61521a23417c9bc734a3cdb668265a15b0 Author: Kang <kxiao.tiger@gmail.com> Date: Wed Feb 22 16:18:46 2023 +0800 topn sync doris order by key topn query optimization apache#15663 commit 1df514c8f0b66ae9a8438617163a31848e519949 Author: Kang <kxiao.tiger@gmail.com> Date: Wed Feb 22 15:14:43 2023 +0800 sync with doris runtime prune for topn query apache#15558
Issue Number: close http://jira.selectdb-in.cc/browse/CORE-1462 Describe the overview of changes. commit e1697741a82f875ca42b0d18caa7972eaa225bee Author: Kang <kxiao.tiger@gmail.com> Date: Thu Jan 19 22:59:29 2023 +0800 [opt](test) scalar_types_p0 use 100k lines dataset and scalar_types_p2 use 1000k (apache#16104) commit 33a47e8d02644123ffd8c5c4353653c1c175e96a Author: Kang <kxiao.tiger@gmail.com> Date: Wed Jan 18 14:17:24 2023 +0800 [testcase](bitmap index)bitmap index testcase (apache#15975) * add bitmap index testcases for all scalar types commit 260a631441834ca7e23da4b77c922eb818eddca7 Author: Kang <kxiao.tiger@gmail.com> Date: Mon Jan 16 16:49:59 2023 +0800 [regression-test](topn)add test cases for nonkey topn query for each scalar type (apache#15790) related to apache#15558 apache#15693 1. dup key table with 17 scalar datatypes 2. unique key table with mow enabled 3. unique key table with mow disabled commit 81cea5219ae86df950f10aa123072df78c7cdf23 Author: Kang <kxiao.tiger@gmail.com> Date: Sun Feb 19 23:28:33 2023 +0800 [bugfix](topn) fix topn read_orderby_key_columns nullptr (apache#16896) The SQL `SELECT nationkey FROM regression_test_query_p0_limit.tpch_tiny_nation ORDER BY nationkey DESC LIMIT 5` make be core dump since dereference a nullptr `read_orderby_key_columns in VCollectIterator::_topn_next`, triggered by skipping _colname_to_value_range init in apache#16818 . This PR makes two changes: 1. avoid read_orderby_key_columns nullptr in TabletReader::_init_orderby_keys_param 2. return error if read_orderby_key_columns is nullptr unexpected in VCollectIterator::_topn_next to avoid core dump commit 2fee1d1d79942e49eddaafdc2b49e49b0651b109 Author: Kang <kxiao.tiger@gmail.com> Date: Fri Feb 10 12:56:33 2023 +0800 [Improvement](topn) add limit threashold session variable and fuzzy for topn optimizations (apache#16514) 1. add limit threshold for topn runtime pushdown and key topn optimization 2. use unified session variable topn_opt_limit_threshold for all topn optimizations 3. add fuzzy support for topn_opt_limit_threshold commit 1696bed39129fcc891f32f64ff1fb43f9531fcd4 Author: Kang <kxiao.tiger@gmail.com> Date: Thu Feb 2 09:13:32 2023 +0800 [bugfix](topn) fix topn runtime predicate getting value bug for decimal type (apache#16331) * fix topn runtime predicate getting value bug for decimal type * fix cast_to_string bug for TYPE_DECIMALV2 commit d70cdf61521a23417c9bc734a3cdb668265a15b0 Author: Kang <kxiao.tiger@gmail.com> Date: Wed Feb 22 16:18:46 2023 +0800 topn sync doris order by key topn query optimization apache#15663 commit 1df514c8f0b66ae9a8438617163a31848e519949 Author: Kang <kxiao.tiger@gmail.com> Date: Wed Feb 22 15:14:43 2023 +0800 sync with doris runtime prune for topn query apache#15558
Proposed changes
Optimize and refactor parquet reader to improve performance.
ssb-flat-100: (single-machine, single-thread)
Problem summary
Performance Issue.
Checklist(Required)
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...