Skip to content

size-separated String serialization for MergeTree#82850

Merged
Avogar merged 41 commits intoClickHouse:masterfrom
amosbird:string-with-size-stream
Oct 3, 2025
Merged

size-separated String serialization for MergeTree#82850
Avogar merged 41 commits intoClickHouse:masterfrom
amosbird:string-with-size-stream

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

@amosbird amosbird commented Jun 29, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add optional .size subcolumn serialization for top-level String columns in MergeTree tables to improve compression and enable efficient subcolumn access. Introduce new MergeTree settings for serialization version control and expression optimization for empty strings.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)
  • New MergeTree settings:

    • serialization_info_version – Controls serialization info format when writing serialization.json. Required for cluster upgrades.

      • DEFAULT – Legacy format, compatible with old servers during rolling upgrades.
      • WITH_TYPES – New format with types_serialization_versions, enabling per-type serialization settings like string_serialization_version. Switch to this after upgrades.
    • string_serialization_version – Controls top-level String column serialization (effective only when serialization_info_version = WITH_TYPES).

      • DEFAULT – Standard inline size format.
      • WITH_SIZE_STREAM – Serialize top-level String columns with separate .size stream for better compression. Backward incompatible.
  • Subcolumn support:

    • Access .size subcolumn across both legacy and new String formats, supporting mixed-format queries.
  • Expression optimizations:

    • optimize_empty_string_comparisons rewrites str = '' into isEmpty(str)/isNotEmpty(str).
    • FunctionToSubcolumnsPass extended to rewrite length(str) as str.size.
  • Sparse encoding enhancements:

    • Support for Sparse encoding with multiple substreams for future extension.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 29, 2025

Workflow [PR], commit [01e4bfe]

Summary:

job_name test_name status info comment
Integration tests (arm_binary, distributed plan, 1/4) failure
test_trace_log_memory_context/test.py::test_memory_context_in_trace_log FAIL
Integration tests (arm_binary, distributed plan, 2/4) failure
test_async_insert_adaptive_busy_timeout/test.py::test_with_replicated_merge_tree_multithread FAIL
AST fuzzer (amd_ubsan) failure
Let op! FAIL

@clickhouse-gh clickhouse-gh bot added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Jun 29, 2025
@amosbird amosbird force-pushed the string-with-size-stream branch from 846deef to 140644b Compare June 29, 2025 16:57
@amosbird amosbird removed the submodule changed At least one submodule changed in this PR. label Jun 29, 2025
@Avogar Avogar self-assigned this Jul 1, 2025
@amosbird amosbird mentioned this pull request Aug 5, 2025
@EmeraldShift
Copy link
Copy Markdown
Contributor

How wasteful would it be to store each string's size twice, in the original column and in .size? I wonder if this option would allow to pick the fastest performance of the current String type, or just.size, at the cost of more disk usage?

@amosbird amosbird force-pushed the string-with-size-stream branch from 140644b to ce0f1e0 Compare August 26, 2025 08:26
@amosbird
Copy link
Copy Markdown
Collaborator Author

How wasteful would it be to store each string's size twice, in the original column and in .size? I wonder if this option would allow to pick the fastest performance of the current String type, or just.size, at the cost of more disk usage?

It's probably not a good design direction. In that case we should manually store a string length column instead.

@amosbird amosbird force-pushed the string-with-size-stream branch 5 times, most recently from 1751ff5 to 0323eb9 Compare September 2, 2025 05:08
@amosbird amosbird force-pushed the string-with-size-stream branch 3 times, most recently from 229b90a to 69dadc7 Compare September 3, 2025 09:59
@amosbird amosbird marked this pull request as ready for review September 3, 2025 10:01
@amosbird amosbird force-pushed the string-with-size-stream branch 7 times, most recently from 0069811 to 15c215d Compare September 7, 2025 14:33
@alexey-milovidov alexey-milovidov added the pr-performance Pull request with some performance improvements label Sep 7, 2025
@amosbird amosbird force-pushed the string-with-size-stream branch from 15c215d to 2166229 Compare September 8, 2025 01:33
@amosbird
Copy link
Copy Markdown
Collaborator Author

test_storage_delta/test.py::test_replicated_database_and_unavailable_s3[1]

#86145

@amosbird amosbird requested a review from Avogar September 25, 2025 09:09
@amosbird
Copy link
Copy Markdown
Collaborator Author

Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel)

#87653

Copy link
Copy Markdown
Member

@Avogar Avogar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's clean up PR description and PR name. Let's make the changelog entry short and write more detailed description in the Documentation entry

@amosbird amosbird changed the title .size-separated String serialization for MergeTree tables size-separated String serialization for MergeTree Sep 26, 2025
@amosbird
Copy link
Copy Markdown
Collaborator Author

AST fuzzer (amd_tsan)

#85404

@amosbird
Copy link
Copy Markdown
Collaborator Author

amosbird commented Sep 28, 2025

This PR makes length(str)-only queries faster even when using the legacy serialization format (inlined-size).

copied-2025-09-28-08_52_24_400 copied-2025-09-28-08_51_57_497 copied-2025-09-28-08_51_43_814

Copy link
Copy Markdown
Member

@Avogar Avogar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Just 2 final small comments and ready to be merged

amosbird and others added 2 commits September 29, 2025 15:38
Co-authored-by: Pavel Kruglov <48961922+Avogar@users.noreply.github.com>
@amosbird
Copy link
Copy Markdown
Collaborator Author

Integration tests (arm_binary, distributed plan, 3/4) test_keeper_memory_soft_limit/test.py::test_soft_limit_create

#87787

@Avogar
Copy link
Copy Markdown
Member

Avogar commented Oct 3, 2025

@Avogar Avogar added this pull request to the merge queue Oct 3, 2025
Merged via the queue into ClickHouse:master with commit 59d18f4 Oct 3, 2025
119 of 123 checks passed
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants