Skip to content

Minimize memory copy in port headers during pipeline construction#70105

Closed
heymind wants to merge 2 commits intoClickHouse:masterfrom
heymind:fix/reduce_port_header_memcpy
Closed

Minimize memory copy in port headers during pipeline construction#70105
heymind wants to merge 2 commits intoClickHouse:masterfrom
heymind:fix/reduce_port_header_memcpy

Conversation

@heymind
Copy link
Copy Markdown
Contributor

@heymind heymind commented Sep 29, 2024

This PR introduces an optimization to reduce memory copying within the Port header during the construction of the pipeline graph. The core change involves modifying the header type in Port.h from an owning type to a const shared_ptr<const Block>, which will significantly reduce memory copying when building expression on tables with huge columns.

  1. The member header type in Port.h is now a const shared_ptr<const Block>, which allows for shared ownership and immutability. Profiling indicated that Port instances were frequently cloned, and this change will improve the performance.

  2. IProcessor instances contain both input_ports and output_ports. When chaining processors using AddSimpleTransform, the output ports of the previous processor share the same header as the input ports of the next processor. This PR ensures that these headers can share the same reference, reducing the number of unnecessary cloning operations by N * num_streams.

  3. The use of const shared_ptr<const Block> ensures stricter immutability compared to the previous implementation, providing additional safety guarantees enforced by the compiler.

  4. Transitioning all occurrences to the new port header style is a substantial task. To facilitate this, the PR introduces AddSimpleTransform method overloads in Pipe and QueryPipelineBuilder, enabling an incremental adaptation process. An example of this implementation can be seen in ExpressionStep::transformPipeline, which effectively reduces header copying by num_streams times.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Minimize memory copy in port headers during pipeline construction

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
  • Exclude: AST fuzzers

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 29, 2024

CLA assistant check
All committers have signed the CLA.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Sep 29, 2024
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-performance Pull request with some performance improvements label Sep 29, 2024
@robot-clickhouse-ci-1
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-1 commented Sep 29, 2024

This is an automated comment for commit a82b9fd with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@antaljanosbenjamin antaljanosbenjamin self-assigned this Sep 30, 2024
Copy link
Copy Markdown
Member

@antaljanosbenjamin antaljanosbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! I have a few small comments, nothing substantial.

image

@antaljanosbenjamin
Copy link
Copy Markdown
Member

Very similar improvement is visible in another PR, so I am not sure if this PR actually helps. Let me figure this out.

@heymind heymind force-pushed the fix/reduce_port_header_memcpy branch 2 times, most recently from af4a227 to c88d50a Compare October 8, 2024 07:42
Co-authored-by: János Benjamin Antal <antaljanosbenjamin@users.noreply.github.com>
@heymind heymind force-pushed the fix/reduce_port_header_memcpy branch from c88d50a to a82b9fd Compare October 8, 2024 08:02
Copy link
Copy Markdown
Member

@antaljanosbenjamin antaljanosbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, however I don't see a reasonable amount of speedups in the perf tests. Without improvement, I am really not sure if we should merge this or not. Do you have any good query that can show the performance difference?
image

@heymind
Copy link
Copy Markdown
Contributor Author

heymind commented Oct 10, 2024

The changes look good, however I don't see a reasonable amount of speedups in the perf tests. Without improvement, I am really not sure if we should merge this or not. Do you have any good query that can show the performance difference? image

This performance issue was identified through profiling, where I noticed numerous memcpy operations in the Port constructor. I will devise a simple SQL query to demonstrate the difference.
image

@heymind
Copy link
Copy Markdown
Contributor Author

heymind commented Oct 10, 2024

I can't show you the real query, but I made a similar one for you to see.

-- First, create the table
CREATE TABLE my_table
(
    `id` String,
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_1` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_2` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_3` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_4` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_5` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_6` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_7` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_8` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_9` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_10` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_11` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_12` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_13` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_14` Nullable(String),
    `long_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_longlong_long_long_long_long_column_name_15` Nullable(String)
)
ENGINE = MergeTree
ORDER BY id;

-- Generate some random data
INSERT INTO my_table SELECT
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number),
    toString(number)
FROM numbers(10000000);

-- The query
SELECT 
    *
FROM 
    my_table AS t1 
JOIN 
    my_table AS t2 ON t1.id = t2.id
JOIN 
    my_table AS t3 ON t2.id = t3.id
JOIN 
    my_table AS t4 ON t3.id = t4.id
JOIN 
    my_table AS t5 ON t4.id = t5.id
JOIN 
    my_table AS t6 ON t5.id = t6.id
JOIN 
    my_table AS t7 ON t6.id = t7.id
JOIN 
    my_table AS t8 ON t7.id = t8.id
JOIN 
    my_table AS t9 ON t8.id = t9.id
JOIN 
    my_table AS t10 ON t9.id = t10.id
JOIN 
    my_table AS t11 ON t10.id = t11.id
JOIN 
    my_table AS t12 ON t11.id = t12.id
JOIN 
    my_table AS t13 ON t12.id = t13.id
JOIN 
    my_table AS t14 ON t13.id = t14.id
JOIN 
    my_table AS t15 ON t14.id = t15.id
JOIN 
    my_table AS t16 ON t15.id = t16.id
JOIN 
    my_table AS t17 ON t16.id = t17.id
JOIN 
    my_table AS t18 ON t17.id = t18.id
JOIN 
    my_table AS t19 ON t18.id = t19.id
JOIN 
    my_table AS t20 ON t19.id = t20.id
JOIN 
    my_table AS t21 ON t20.id = t21.id
JOIN 
    my_table AS t22 ON t21.id = t22.id
JOIN 
    my_table AS t23 ON t22.id = t23.id
JOIN 
    my_table AS t24 ON t23.id = t24.id
JOIN 
    my_table AS t25 ON t24.id = t25.id
JOIN 
    my_table AS t26 ON t25.id = t26.id
JOIN 
    my_table AS t27 ON t26.id = t27.id
JOIN 
    my_table AS t28 ON t27.id = t28.id
JOIN 
    my_table AS t29 ON t28.id = t29.id
JOIN 
    my_table AS t30 ON t29.id = t30.id
JOIN 
    my_table AS t31 ON t30.id = t31.id
JOIN 
    my_table AS t32 ON t31.id = t32.id
JOIN 
    my_table AS t33 ON t32.id = t33.id
JOIN 
    my_table AS t34 ON t33.id = t34.id
JOIN 
    my_table AS t35 ON t34.id = t35.id
JOIN 
    my_table AS t36 ON t35.id = t36.id
JOIN 
    my_table AS t37 ON t36.id = t37.id
JOIN 
    my_table AS t38 ON t37.id = t38.id
JOIN 
    my_table AS t39 ON t38.id = t39.id
JOIN 
    my_table AS t40 ON t39.id = t40.id
JOIN 
    my_table AS t41 ON t40.id = t41.id
JOIN 
    my_table AS t42 ON t41.id = t42.id
JOIN 
    my_table AS t43 ON t42.id = t43.id
JOIN 
    my_table AS t44 ON t43.id = t44.id
JOIN 
    my_table AS t45 ON t44.id = t45.id
JOIN 
    my_table AS t46 ON t45.id = t46.id
JOIN 
    my_table AS t47 ON t46.id = t47.id
JOIN 
    my_table AS t48 ON t47.id = t48.id
JOIN 
    my_table AS t49 ON t48.id = t49.id
JOIN 
    my_table AS t50 ON t49.id = t50.id
settings max_threads=128;

-- The result set is excessively large, and this optimization has no noticeable impact on execution.
-- Use explain pipeline to see the differences.

Before:

explain pipeline SELECT * ...
...
543 rows in set. Elapsed: 12.498 sec.

After:

explain pipeline SELECT * ...
...
543 rows in set. Elapsed: 7.708 sec.

@antaljanosbenjamin
Copy link
Copy Markdown
Member

The result set is excessively large, and this optimization has no noticeable impact on execution.

Then why should we do this? Just to make EXPLAIN queries faster in some edge cases?

@devcrafter
Copy link
Copy Markdown
Member

devcrafter commented Oct 14, 2024

I don't see any such slowdowns in cloud instance, node 32GB, 8 vCPU 🤔

EXPLAIN PIPELINE
SELECT *
FROM my_table AS t1
INNER JOIN my_table AS t2 ON t1.id = t2.id
...
INNER JOIN my_table AS t50 ON t49.id = t50.id

Query id: f0c27860-ebae-4517-b191-3d56c5ed4ec1


   ┌─explain────────────────────────────┐
1. │ (Expression)                       │
2. │ ExpressionTransform × 3            │
3. │   (ReadFromRemoteParallelReplicas) │
   └────────────────────────────────────┘

3 rows in set. Elapsed: 0.436 sec.

UPD: w/o parallel replicas as well

EXPLAIN PIPELINE
SELECT *
FROM my_table AS t1
INNER JOIN my_table AS t2 ON t1.id = t2.id
...
INNER JOIN my_table AS t50 ON t49.id = t50.id
SETTINGS enable_parallel_replicas = 0
FORMAT `Null`

Query id: 7fa87904-d3f7-4785-8e2e-47aa6b0067bd

Ok.

0 rows in set. Elapsed: 0.958 sec.

@devcrafter
Copy link
Copy Markdown
Member

The slowdown depends on max_threads settings. For the example, with a higher value, the pipeline becomes wider (more ExpressionTransform created)

EXPLAIN PIPELINE
SELECT *
FROM my_table AS t1
INNER JOIN my_table AS t2 ON t1.id = t2.id
...
INNER JOIN my_table AS t50 ON t49.id = t50.id
SETTINGS enable_parallel_replicas = 0, max_threads = 128
FORMAT `Null`

Query id: 889fb2a7-cb6c-48b0-9ee5-adb5d395c532

Ok.

0 rows in set. Elapsed: 7.048 sec.

@heymind
Copy link
Copy Markdown
Contributor Author

heymind commented Oct 16, 2024

The result set is excessively large, and this optimization has no noticeable impact on execution.

Then why should we do this? Just to make EXPLAIN queries faster in some edge cases?

Such queries, when not using EXPLAIN, will also be faster. This optimization focuses on building a pipeline. In our real case, it reduced the building time from 32 seconds to 16 seconds. The example SQL provided here is just for explanation.

@antaljanosbenjamin
Copy link
Copy Markdown
Member

Sorry, this PR went a bit low on my list.

Such queries, when not using EXPLAIN, will also be faster. This optimization focuses on building a pipeline. In our real case, it reduced the building time from 32 seconds to 16 seconds. The example SQL provided here is just for explanation.

I will try to add a performance test, because we should prove that it makes something faster and make sure we keep that performance in line in the future. Without a performance test this feels like a bugfix without a test to verify it works.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Dec 10, 2024

Dear @antaljanosbenjamin, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@Algunenano
Copy link
Copy Markdown
Member

Continuing in #83381 since pushing to the fork is not enabled

@Algunenano Algunenano closed this Jul 7, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants