Minimize memory copy in port headers during pipeline construction by heymind · Pull Request #70105 · ClickHouse/ClickHouse

heymind · 2024-09-29T07:22:19Z

This PR introduces an optimization to reduce memory copying within the Port header during the construction of the pipeline graph. The core change involves modifying the header type in Port.h from an owning type to a const shared_ptr<const Block>, which will significantly reduce memory copying when building expression on tables with huge columns.

The member header type in Port.h is now a const shared_ptr<const Block>, which allows for shared ownership and immutability. Profiling indicated that Port instances were frequently cloned, and this change will improve the performance.
IProcessor instances contain both input_ports and output_ports. When chaining processors using AddSimpleTransform, the output ports of the previous processor share the same header as the input ports of the next processor. This PR ensures that these headers can share the same reference, reducing the number of unnecessary cloning operations by N * num_streams.
The use of const shared_ptr<const Block> ensures stricter immutability compared to the previous implementation, providing additional safety guarantees enforced by the compiler.
Transitioning all occurrences to the new port header style is a substantial task. To facilitate this, the PR introduces AddSimpleTransform method overloads in Pipe and QueryPipelineBuilder, enabling an incremental adaptation process. An example of this implementation can be seen in ExpressionStep::transformPipeline, which effectively reduces header copying by num_streams times.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Minimize memory copy in port headers during pipeline construction

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Allow: All Required Checks
Allow: Stateless tests
Allow: Stateful tests
Allow: Integration Tests
Allow: Performance tests
Allow: All Builds
Allow: batch 1, 2 for multi-batch jobs
Allow: batch 3, 4, 5, 6 for multi-batch jobs

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
Exclude: AST fuzzers

CLAassistant · 2024-09-29T07:22:27Z

All committers have signed the CLA.

robot-clickhouse-ci-1 · 2024-09-29T22:12:33Z

This is an automated comment for commit a82b9fd with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

antaljanosbenjamin

Really nice! I have a few small comments, nothing substantial.

src/Processors/Transforms/ExpressionTransform.h

src/Processors/Transforms/ExpressionTransform.cpp

src/Processors/Port.h

src/QueryPipeline/Pipe.h

antaljanosbenjamin · 2024-09-30T14:20:30Z

Very similar improvement is visible in another PR, so I am not sure if this PR actually helps. Let me figure this out.

Co-authored-by: János Benjamin Antal <antaljanosbenjamin@users.noreply.github.com>

antaljanosbenjamin

The changes look good, however I don't see a reasonable amount of speedups in the perf tests. Without improvement, I am really not sure if we should merge this or not. Do you have any good query that can show the performance difference?

src/Processors/Port.h

heymind · 2024-10-10T03:05:32Z

The changes look good, however I don't see a reasonable amount of speedups in the perf tests. Without improvement, I am really not sure if we should merge this or not. Do you have any good query that can show the performance difference?

This performance issue was identified through profiling, where I noticed numerous memcpy operations in the Port constructor. I will devise a simple SQL query to demonstrate the difference.

heymind · 2024-10-10T07:44:40Z

I can't show you the real query, but I made a similar one for you to see.

-- First, create the table
CREATE TABLE my_table
(
    `id` String,
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_1` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_2` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_3` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_4` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_5` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_6` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_7` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_8` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_9` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_10` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_11` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_12` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_13` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_14` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_15` Nullable(String)
)
ENGINE = MergeTree
ORDER BY id;

-- Generate some random data
INSERT INTO my_table SELECT
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number)
FROM numbers(10000000);

-- The query
SELECT 
    *
FROM 
    my_table AS t1 
JOIN 
    my_table AS t2 ON t1.id = t2.id
JOIN 
    my_table AS t3 ON t2.id = t3.id
JOIN 
    my_table AS t4 ON t3.id = t4.id
JOIN 
    my_table AS t5 ON t4.id = t5.id
JOIN 
    my_table AS t6 ON t5.id = t6.id
JOIN 
    my_table AS t7 ON t6.id = t7.id
JOIN 
    my_table AS t8 ON t7.id = t8.id
JOIN 
    my_table AS t9 ON t8.id = t9.id
JOIN 
    my_table AS t10 ON t9.id = t10.id
JOIN 
    my_table AS t11 ON t10.id = t11.id
JOIN 
    my_table AS t12 ON t11.id = t12.id
JOIN 
    my_table AS t13 ON t12.id = t13.id
JOIN 
    my_table AS t14 ON t13.id = t14.id
JOIN 
    my_table AS t15 ON t14.id = t15.id
JOIN 
    my_table AS t16 ON t15.id = t16.id
JOIN 
    my_table AS t17 ON t16.id = t17.id
JOIN 
    my_table AS t18 ON t17.id = t18.id
JOIN 
    my_table AS t19 ON t18.id = t19.id
JOIN 
    my_table AS t20 ON t19.id = t20.id
JOIN 
    my_table AS t21 ON t20.id = t21.id
JOIN 
    my_table AS t22 ON t21.id = t22.id
JOIN 
    my_table AS t23 ON t22.id = t23.id
JOIN 
    my_table AS t24 ON t23.id = t24.id
JOIN 
    my_table AS t25 ON t24.id = t25.id
JOIN 
    my_table AS t26 ON t25.id = t26.id
JOIN 
    my_table AS t27 ON t26.id = t27.id
JOIN 
    my_table AS t28 ON t27.id = t28.id
JOIN 
    my_table AS t29 ON t28.id = t29.id
JOIN 
    my_table AS t30 ON t29.id = t30.id
JOIN 
    my_table AS t31 ON t30.id = t31.id
JOIN 
    my_table AS t32 ON t31.id = t32.id
JOIN 
    my_table AS t33 ON t32.id = t33.id
JOIN 
    my_table AS t34 ON t33.id = t34.id
JOIN 
    my_table AS t35 ON t34.id = t35.id
JOIN 
    my_table AS t36 ON t35.id = t36.id
JOIN 
    my_table AS t37 ON t36.id = t37.id
JOIN 
    my_table AS t38 ON t37.id = t38.id
JOIN 
    my_table AS t39 ON t38.id = t39.id
JOIN 
    my_table AS t40 ON t39.id = t40.id
JOIN 
    my_table AS t41 ON t40.id = t41.id
JOIN 
    my_table AS t42 ON t41.id = t42.id
JOIN 
    my_table AS t43 ON t42.id = t43.id
JOIN 
    my_table AS t44 ON t43.id = t44.id
JOIN 
    my_table AS t45 ON t44.id = t45.id
JOIN 
    my_table AS t46 ON t45.id = t46.id
JOIN 
    my_table AS t47 ON t46.id = t47.id
JOIN 
    my_table AS t48 ON t47.id = t48.id
JOIN 
    my_table AS t49 ON t48.id = t49.id
JOIN 
    my_table AS t50 ON t49.id = t50.id
settings max_threads=128;

-- The result set is excessively large, and this optimization has no noticeable impact on execution.
-- Use explain pipeline to see the differences.

Before:

explain pipeline SELECT * ...
...
543 rows in set. Elapsed: 12.498 sec.

After:

explain pipeline SELECT * ...
...
543 rows in set. Elapsed: 7.708 sec.

antaljanosbenjamin · 2024-10-14T10:31:16Z

The result set is excessively large, and this optimization has no noticeable impact on execution.

Then why should we do this? Just to make EXPLAIN queries faster in some edge cases?

devcrafter · 2024-10-14T14:06:50Z

I don't see any such slowdowns in cloud instance, node 32GB, 8 vCPU 🤔

EXPLAIN PIPELINE
SELECT *
FROM my_table AS t1
INNER JOIN my_table AS t2 ON t1.id = t2.id
...
INNER JOIN my_table AS t50 ON t49.id = t50.id

Query id: f0c27860-ebae-4517-b191-3d56c5ed4ec1


   ┌─explain────────────────────────────┐
1. │ (Expression)                       │
2. │ ExpressionTransform × 3            │
3. │   (ReadFromRemoteParallelReplicas) │
   └────────────────────────────────────┘

3 rows in set. Elapsed: 0.436 sec.

UPD: w/o parallel replicas as well

EXPLAIN PIPELINE
SELECT *
FROM my_table AS t1
INNER JOIN my_table AS t2 ON t1.id = t2.id
...
INNER JOIN my_table AS t50 ON t49.id = t50.id
SETTINGS enable_parallel_replicas = 0
FORMAT `Null`

Query id: 7fa87904-d3f7-4785-8e2e-47aa6b0067bd

Ok.

0 rows in set. Elapsed: 0.958 sec.

devcrafter · 2024-10-14T14:46:17Z

The slowdown depends on max_threads settings. For the example, with a higher value, the pipeline becomes wider (more ExpressionTransform created)

EXPLAIN PIPELINE
SELECT *
FROM my_table AS t1
INNER JOIN my_table AS t2 ON t1.id = t2.id
...
INNER JOIN my_table AS t50 ON t49.id = t50.id
SETTINGS enable_parallel_replicas = 0, max_threads = 128
FORMAT `Null`

Query id: 889fb2a7-cb6c-48b0-9ee5-adb5d395c532

Ok.

0 rows in set. Elapsed: 7.048 sec.

heymind · 2024-10-16T07:22:17Z

The result set is excessively large, and this optimization has no noticeable impact on execution.

Then why should we do this? Just to make EXPLAIN queries faster in some edge cases?

Such queries, when not using EXPLAIN, will also be faster. This optimization focuses on building a pipeline. In our real case, it reduced the building time from 32 seconds to 16 seconds. The example SQL provided here is just for explanation.

antaljanosbenjamin · 2024-11-08T10:56:08Z

Sorry, this PR went a bit low on my list.

Such queries, when not using EXPLAIN, will also be faster. This optimization focuses on building a pipeline. In our real case, it reduced the building time from 32 seconds to 16 seconds. The example SQL provided here is just for explanation.

I will try to add a performance test, because we should prove that it makes something faster and make sure we keep that performance in line in the future. Without a performance test this feels like a bugfix without a test to verify it works.

clickhouse-gh · 2024-12-10T13:17:46Z

Dear @antaljanosbenjamin, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Algunenano · 2025-07-07T10:15:29Z

Continuing in #83381 since pushing to the fork is not enabled

Continue #70105: Reduce port header memcpy

Minimize memory copy in port headers during pipeline construction

bc7bd12

alexey-milovidov added the can be tested Allows running workflows for external contributors label Sep 29, 2024

robot-clickhouse-ci-1 added the pr-performance Pull request with some performance improvements label Sep 29, 2024

antaljanosbenjamin self-assigned this Sep 30, 2024

antaljanosbenjamin requested changes Sep 30, 2024

View reviewed changes

heymind force-pushed the fix/reduce_port_header_memcpy branch 2 times, most recently from af4a227 to c88d50a Compare October 8, 2024 07:42

Apply suggestions from code review

a82b9fd

Co-authored-by: János Benjamin Antal <antaljanosbenjamin@users.noreply.github.com>

heymind force-pushed the fix/reduce_port_header_memcpy branch from c88d50a to a82b9fd Compare October 8, 2024 08:02

heymind requested a review from antaljanosbenjamin October 9, 2024 02:00

antaljanosbenjamin reviewed Oct 9, 2024

View reviewed changes

src/Processors/Port.h Show resolved Hide resolved

heymind requested a review from antaljanosbenjamin October 11, 2024 02:09

clickhouse-gh bot unassigned antaljanosbenjamin Dec 10, 2024

Algunenano self-assigned this Jun 30, 2025

Algunenano mentioned this pull request Jul 7, 2025

Continue #70105: Reduce port header memcpy #83381

Merged

1 task

Algunenano closed this Jul 7, 2025

github-merge-queue bot pushed a commit that referenced this pull request Jul 10, 2025

Merge pull request #83381 from Algunenano/reduce_port_header_memcpy

97aa10b

Continue #70105: Reduce port header memcpy

Algunenano mentioned this pull request Jul 11, 2025

Improve the performance of pipeline building #83631

Merged

3 tasks

Conversation

heymind commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

CI Settings (Only check the boxes if you know what you are doing):

Uh oh!

CLAassistant commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robot-clickhouse-ci-1 commented Sep 29, 2024 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antaljanosbenjamin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antaljanosbenjamin commented Sep 30, 2024

Uh oh!

antaljanosbenjamin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

heymind commented Oct 10, 2024

Uh oh!

heymind commented Oct 10, 2024

Uh oh!

antaljanosbenjamin commented Oct 14, 2024

Uh oh!

devcrafter commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devcrafter commented Oct 14, 2024

Uh oh!

heymind commented Oct 16, 2024

Uh oh!

antaljanosbenjamin commented Nov 8, 2024

Uh oh!

clickhouse-gh bot commented Dec 10, 2024

Uh oh!

Algunenano commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

heymind commented Sep 29, 2024 •

edited

Loading

CLAassistant commented Sep 29, 2024 •

edited

Loading

robot-clickhouse-ci-1 commented Sep 29, 2024 •

edited by robot-clickhouse

Loading

devcrafter commented Oct 14, 2024 •

edited

Loading