Conversation
7331186 to
362f074
Compare
99c9d75 to
8b84357
Compare
f59f604 to
cda07f8
Compare
|
Dear @amosbird, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message. |
|
@amosbird would it be possible to break this PR into few smaller ones? |
This commit implements the pushdown of the TopN threshold state into MergeTreeSource to optimize filtering during the read phase. By transferring the threshold representing the (N-1)th element of the TopN state, filtering can occur early in the read phase, improving performance by skipping rows that fall below the threshold.
… and reusable when the part set is unchanged
… default properly
Add support for StringWithSizeStream, a new string type that stores string lengths in a separate stream. This enables the .size subcolumn and may improve performance in certain workloads. Enable .size subcolumn access for both String and StringWithSizeStream types, allowing mixed queries over both formats. Introduce a new setting optimize_empty_string_comparisons to rewrite expressions like str = '' into isEmpty(str) or isNotEmpty(str). Extend FunctionToSubcolumnsPass to support String columns, enabling length(str) to be rewritten as str.size. Add MergeTreeSetting `serialize_string_with_size_stream` to control whether `String` columns are serialized with a separate size stream. This setting is enabled by default to test the new string serialization, and will be disabled by default when landing due to compatibility concerns. In benchmarks, the new layout does not provide general performance improvements and may even cause slight regressions. This is likely because the original single-stream String format is already highly optimized, and in most cases, decompression is the primary bottleneck. Separating the data into two streams (sizes and content) may introduce additional decompression overhead compared to a single contiguous stream. Nevertheless, when only the .size subcolumn is queried—especially for large strings—StringWithSizeStream provides significant performance benefits, as demonstrated in ClickBench Q27.
|
ClickBench query Q4 Likewise,
got impressive speedups. @amosbird Do you have an idea what changes speed these up? Asking so we can prioritize the corresponding PRs. |
Yes, I do.
|
|
As a personal exercise in writing and building ClickHouse code, I cherry-picked Push TopN threshold to MergeTreeSource and the simple query
|
|
|
||
| String getName() const override { return name; } | ||
|
|
||
| bool useDefaultImplementationForNulls() const override { return true; } |
There was a problem hiding this comment.
if useDefaultImplementationForNulls()=true, the column for nullable type in arguments will be non-nullable, and current_ptr->columns is nullable. There will be type mismatch. Also, for useDefaultImplementationForLowCardinalityColumns()
|
Dear @amosbird, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
WIP
Documentation entry for user-facing changes