Skip to content

Remove zero byte#85063

Merged
Avogar merged 173 commits intomasterfrom
remove-zero-byte
Aug 21, 2025
Merged

Remove zero byte#85063
Avogar merged 173 commits intomasterfrom
remove-zero-byte

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov commented Aug 4, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Remove zero byte. Closes #85062. A few minor bugs were fixed. Functions structureToProtobufSchema, structureToCapnProtoSchema didn't correctly put a zero-terminating byte and were using a newline instead of it. That was leading to a missing newline in the output, and could lead to buffer overflows while using other functions that depend on the zero byte (such as logTrace, demangle, extractURLParameter, toStringCutToZero, and encrypt/decrypt). The regexp_tree dictionary layout didn't support processing strings with zero bytes. The formatRowNoNewline function, called with Values format or with any other format without a newline at the end of rows, erroneously cuts the last character of the output. Function stem contained an exception-safety error that could lead to a memory leak in a very rare scenario. The initcap function worked in the wrong way for FixedString arguments: it didn't recognize the start of the word at the start of the string if the previous string in a block ended with a word character. Fixed a security vulnerability of the Apache ORC format, which could lead to the exposure of uninitialized memory. Changed behavior of the function replaceRegexpAll and the corresponding alias, REGEXP_REPLACE: now it can do an empty match at the end of the string even if the previous match processed the whole string, such as in the case of ^a*|a*$ or ^|.* - this corresponds to the semantic of JavaScript, Perl, Python, PHP, Ruby, but differs to the semantic of PostgreSQL. Implementation of many functions has been simplified and optimized. Documentation for several functions was wrong and has now been fixed. Keep in mind that the output of byteSize for String columns and complex types, which consisted of String columns, has changed (from 9 bytes per empty string to 8 bytes per empty string), and this is normal.

@nikitamikhaylov
Copy link
Copy Markdown
Member

FYI @al13n321, it can remove additional memcpy while reading Parquet files.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 4, 2025

Workflow [PR], commit [4fa0665]

Summary:

job_name test_name status info comment
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel) failure
02428_decimal_in_floating_point_literal FAIL
03315_join_temporary_table_names FAIL
02504_regexp_dictionary_table_source FAIL
00392_enum_nested_alter FAIL
00411_long_accurate_number_comparison_int2 FAIL
Exception in test runner FAIL
Killed by signal (in clickhouse-server.log or clickhouse-server.err.log) FAIL
Fatal messages (in clickhouse-server.log or clickhouse-server.err.log) FAIL
Stateless tests (amd_debug, distributed plan, s3 storage, parallel) failure
01086_odbc_roundtrip FAIL
Performance Comparison (arm_release, master_head, 3/3) failure
Check Results failure

@clickhouse-gh clickhouse-gh bot added the pr-backward-incompatible Pull request with backwards incompatible changes label Aug 4, 2025
@amosbird
Copy link
Copy Markdown
Collaborator

amosbird commented Aug 5, 2025

It would be perfect to land this before #82850, so that the new serialization layout can be better aligned and optimized.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

alexey-milovidov commented Aug 16, 2025

I've sped up hashed_dictionary by adding alignment.
Upd: but it's controversial - reverted.

Copy link
Copy Markdown
Member

@Avogar Avogar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Avogar
Copy link
Copy Markdown
Member

Avogar commented Aug 21, 2025

@Avogar Avogar added this pull request to the merge queue Aug 21, 2025
Merged via the queue into master with commit d25f875 Aug 21, 2025
119 of 122 checks passed
@Avogar Avogar deleted the remove-zero-byte branch August 21, 2025 12:03
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 21, 2025
Avogar added a commit to Avogar/ClickHouse that referenced this pull request Aug 26, 2025
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2025
Fix use-of-unitialized-value and crash introduced in #85063
robot-ch-test-poll1 added a commit that referenced this pull request Aug 26, 2025
Cherry pick #86171 to 25.8: Fix use-of-unitialized-value and crash introduced in #85063
clickhouse-gh bot added a commit that referenced this pull request Aug 26, 2025
Backport #86171 to 25.8: Fix use-of-unitialized-value and crash introduced in #85063
@vitlibar vitlibar mentioned this pull request Mar 16, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove extraneous terminating zero byte from ColumnString

7 participants