Skip to content

WIP some perf optimizations#81944

Draft
amosbird wants to merge 13 commits intoClickHouse:masterfrom
amosbird:bench-opt
Draft

WIP some perf optimizations#81944
amosbird wants to merge 13 commits intoClickHouse:masterfrom
amosbird:bench-opt

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

WIP

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 16, 2025

Workflow [PR], commit [0b43aeb]

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Jun 16, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 16, 2025

Workflow [PR], commit [e3a48c0]

Summary:
15 failures out of 80 shown:

job_name test_name status info comment
Build (amd_darwin) failure
Build ClickHouse failure
Build (arm_darwin) failure
Build ClickHouse failure
Unit tests (tsan) failure
Stateless tests (amd_asan, distributed plan, parallel, 1/2) failure
02872_null_as_default_nested FAIL
Exception in test runner FAIL
Killed by signal (in clickhouse-server.log or clickhouse-server.err.log) FAIL
Fatal messages (in clickhouse-server.log or clickhouse-server.err.log) FAIL
Stateless tests (amd_asan, distributed plan, sequential) dropped
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel) dropped
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, sequential) dropped
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) dropped
Stateless tests (amd_binary, ParallelReplicas, s3 storage, sequential) dropped
Stateless tests (amd_debug, AsyncInsert, s3 storage, parallel) dropped
Stateless tests (amd_debug, AsyncInsert, s3 storage, sequential) dropped
Stateless tests (amd_debug, sequential) dropped
Stateless tests (amd_tsan, parallel, 1/2) dropped
Stateless tests (amd_tsan, parallel, 2/2) dropped
Stateless tests (amd_tsan, sequential, 1/2) dropped

@amosbird amosbird force-pushed the bench-opt branch 3 times, most recently from 7331186 to 362f074 Compare June 17, 2025 09:01
@novikd novikd self-assigned this Jun 17, 2025
@amosbird amosbird force-pushed the bench-opt branch 9 times, most recently from 99c9d75 to 8b84357 Compare June 22, 2025 19:49
@clickhouse-gh clickhouse-gh bot added the submodule changed At least one submodule changed in this PR. label Jun 22, 2025
@amosbird amosbird force-pushed the bench-opt branch 3 times, most recently from f59f604 to cda07f8 Compare June 24, 2025 18:45
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 26, 2025

Dear @amosbird, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@antonio2368
Copy link
Copy Markdown
Member

@amosbird would it be possible to break this PR into few smaller ones?
If I understood everything correctly, this PR consists of couple of different optimizations.
I think it would make it easier to merge everything incrementally as it's easier to review and optimizations would not block each other for merge.
WDYT?

@amosbird
Copy link
Copy Markdown
Collaborator Author

amosbird commented Sep 8, 2025

@antonio2368 This PR has already been split into several PRs (#82478 #81992 #82104 #82380 #82850 and more to come). This one is specifically for producing the official binary to run ClickBench.

This commit implements the pushdown of the TopN threshold state into MergeTreeSource to optimize filtering during the read phase. By transferring the threshold representing the (N-1)th element of the TopN state, filtering can occur early in the read phase, improving performance by skipping rows that fall below the threshold.
amosbird and others added 7 commits September 8, 2025 23:07
… and reusable when the part set is unchanged
Add support for StringWithSizeStream, a new string type that stores string lengths in a separate stream. This enables the .size subcolumn and may improve performance in certain workloads.

Enable .size subcolumn access for both String and StringWithSizeStream types, allowing mixed queries over both formats.

Introduce a new setting optimize_empty_string_comparisons to rewrite expressions like str = '' into isEmpty(str) or isNotEmpty(str). Extend FunctionToSubcolumnsPass to support String columns, enabling length(str) to be rewritten as str.size.

Add MergeTreeSetting `serialize_string_with_size_stream` to control whether `String` columns are serialized with a separate size stream. This setting is enabled by default to test the new string serialization, and will be disabled by default when landing due to compatibility concerns.

In benchmarks, the new layout does not provide general performance improvements and may even cause slight regressions. This is likely because the original single-stream String format is already highly optimized, and in most cases, decompression is the primary bottleneck. Separating the data into two streams (sizes and content) may introduce additional decompression overhead compared to a single contiguous stream.

Nevertheless, when only the .size subcolumn is queried—especially for large strings—StringWithSizeStream provides significant performance benefits, as demonstrated in ClickBench Q27.
@rschu1ze
Copy link
Copy Markdown
Member

ClickBench query Q4 SELECT COUNT(DISTINCT UserID) FROM hits; got faster by >4x (dashboard). I read through the list of included changes and I see some modifications in hash tables in the code. But it is not really clear which exact change caused the speedup.

Likewise,

  • Q11 SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
  • Q23 `SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;``
  • Q27 SELECT CounterID, AVG(length(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL <> '' GROUP BY CounterID HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
  • Q30 SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;

got impressive speedups.

@amosbird Do you have an idea what changes speed these up? Asking so we can prioritize the corresponding PRs.

@amosbird
Copy link
Copy Markdown
Collaborator Author

Do you have an idea what changes speed these up?

Yes, I do.

  • Q4 is faster because count_distinct_optimization is enabled by default in this PR. This is the easiest optimization to achieve, and since similar techniques are used in DuckDB and Apache Doris, it makes sense to enable it by default.

  • Q23 is sped up by dynamic pushdown of the TopN predicate. This part isn’t yet separated in the PR. It can be prioritized, but I’m less focused on it than the next improvement.

  • All other aggregation queries benefit from a new hash table implementation. I plan to use this to replace my old string hash table implementation. This part isn’t separated in the PR yet either, but I’ll likely work on it this week (I have several pending PRs at the moment...).

@EmeraldShift
Copy link
Copy Markdown
Contributor

As a personal exercise in writing and building ClickHouse code, I cherry-picked Push TopN threshold to MergeTreeSource
and built it -- it's quite fast! I made an example table:

CREATE TABLE test
ENGINE = MergeTree
ORDER BY (id, time)
AS SELECT
    CAST(number, 'UInt8') AS id,
    CAST(number, 'UInt64') AS time
FROM numbers(10000)

and the simple query SELECT * FROM test ORDER BY time LIMIT 10 goes from ~2s to 0.8s with the optimization! However, when I add SETTINGS max_threads=1, the speedup disappears, and doesn't come back even if I add a min-max index on time, so I have a few curiosities:

  1. What exactly is the change that the TopN optimization is making to the pipeline? From the code it looks like it's adding an initial prewhere node that filters based on a global threshold state.
  2. Why does the speedup disappear with one thread? Since the table is partially ordered by time, I would expect the global threshold to very quickly reach 9 and thus filter out all greater values (most of the table). Is my mental model wrong?
  3. Now that skip indexes can be used on data reading, will it be possible to apply skip indexes based on the global threshold? Is that compatible with the current shape of the change?
  4. I initially wanted to extract the TopN optimization into its own PR, both because I am interested in the feature, and to learn how to contribute to ClickHouse. Does this sound fine to you? Maybe it's too difficult a first change to try and introduce / rebase onto current master? What are your thoughts here?


String getName() const override { return name; }

bool useDefaultImplementationForNulls() const override { return true; }
Copy link
Copy Markdown

@gaoyuanning gaoyuanning Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if useDefaultImplementationForNulls()=true, the column for nullable type in arguments will be non-nullable, and current_ptr->columns is nullable. There will be type mismatch. Also, for useDefaultImplementationForLowCardinalityColumns()

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 3, 2026

Dear @amosbird, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants