Skip to content

Fix flaky test_disallow_concurrency#93612

Merged
pamarcos merged 15 commits intoClickHouse:masterfrom
pamarcos:fix-test_disallow_concurrency
Jan 12, 2026
Merged

Fix flaky test_disallow_concurrency#93612
pamarcos merged 15 commits intoClickHouse:masterfrom
pamarcos:fix-test_disallow_concurrency

Conversation

@pamarcos
Copy link
Copy Markdown
Member

@pamarcos pamarcos commented Jan 7, 2026

Fix LOGICAL_ERROR when restoring ReplicatedMergeTree with deduplication race

When restoring a ReplicatedMergeTree table ON CLUSTER, multiple replicas
restore the same parts concurrently. If one replica commits a part to
ZooKeeper first, the other replica's part gets deduplicated. The code
assumed deduplicated parts always came from ATTACH PART (located in
detached/attaching_*) and threw LOGICAL_ERROR for other paths, causing
the server to abort.

During restore, parts are in tmp_restore_* directories instead. Those parts
are now properly deduplicated as well.

Log traces can be found in the logs of any of the failures:

Logical error: 'Unexpected relative path for a deduplicated part: 
store/813/81355abf-88f2-4516-9658-3396821e02f9/tmp_restore_all_36_41_1-781a6406-308b-4a6f-b253-aa6ea64efcb8/'.

In 89da254 I changed the flaky check to run 1000 times to ensure the tests are not flaky anymore. Considering from prior failures) a 2-8% failure rate, we should see at least some failures if the flakiness was still there. Fortunately, all tests succeeded after 5h of job running:

❯ grep "%] PASS" job.log | wc -l
2140

❯ grep "%] FAIL" job.log | wc -l
0

❯ grep "%] ERR" job.log | wc -l
0

Closes #93413
Closes #93235
Closes #93177
Closes #68012
Closes #93549
Closes #93550

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix LOGICAL_ERROR when restoring ReplicatedMergeTree with deduplication race

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

…on race

When restoring a ReplicatedMergeTree table ON CLUSTER, multiple replicas
restore the same parts concurrently. If one replica commits a part to
ZooKeeper first, the other replica's part gets deduplicated. The code
assumed deduplicated parts always came from ATTACH PART (located in
detached/attaching_*) and threw LOGICAL_ERROR for other paths, causing
the server to abort.

During restore, parts are in tmp_restore_* directories instead. Now
deduplicated parts from restore are simply removed since the data already
exists on another replica and will be fetched via normal replication.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 7, 2026

Workflow [PR], commit [d2b3915]

Summary:

job_name test_name status info comment
Integration tests (amd_asan, db disk, old analyzer, 6/6) failure
test_max_temporary_data_size_on_disk/test.py::test FAIL cidb, issue ISSUE CREATED
Integration tests (amd_asan, targeted) error IGNORED

@clickhouse-gh clickhouse-gh bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Jan 7, 2026
@pamarcos pamarcos requested a review from Copilot January 7, 2026 17:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a LOGICAL_ERROR crash that occurred when restoring ReplicatedMergeTree tables ON CLUSTER due to a deduplication race condition between replicas.

Key Changes:

  • Fixed handling of deduplicated parts during concurrent restore operations by removing them instead of throwing LOGICAL_ERROR
  • Removed unused import from test file
  • Increased flaky test repeat counts to better catch concurrency issues

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp Fixed deduplication logic to handle parts from RESTORE operations by removing them instead of throwing LOGICAL_ERROR
tests/integration/test_backup_restore_on_cluster/test_disallow_concurrency.py Removed unused typing.List import
ci/jobs/integration_test_job.py Increased flaky check repeat counts for better test coverage

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

@pamarcos pamarcos requested a review from Copilot January 8, 2026 10:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

@pamarcos pamarcos requested a review from Copilot January 8, 2026 12:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

const auto & relative_path = part->getDataPartStorage().getRelativePath();
const auto part_dir = fs::path(relative_path).parent_path().filename().string();

if (relative_path.ends_with("detached/attaching_" + part->name + "/"))
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic checked that the path did NOT match the expected pattern and threw an error. The new logic inverts this to a positive check. However, this creates inconsistency: the first condition uses ends_with on relative_path, while the second condition uses starts_with on part_dir (which is just the parent directory name). Consider using a consistent approach for both checks, either both operating on relative_path or both on part_dir, to improve code clarity and reduce the chance of logic errors.

Copilot uses AI. Check for mistakes.
@vitlibar
Copy link
Copy Markdown
Member

vitlibar commented Jan 8, 2026

When restoring a ReplicatedMergeTree table ON CLUSTER, multiple replicas restore the same parts concurrently.

It shouldn't be like this. There is a condition (see) which checks that only one replica inserts restored parts into each replicated table.

robot-clickhouse added a commit that referenced this pull request Jan 12, 2026
Cherry pick #93612 to 25.12: Fix flaky test_disallow_concurrency
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jan 12, 2026
pamarcos added a commit that referenced this pull request Jan 12, 2026
Cherry pick #93612 to 25.3: Fix flaky test_disallow_concurrency
pamarcos added a commit that referenced this pull request Jan 12, 2026
Cherry pick #93612 to 25.8: Fix flaky test_disallow_concurrency
pamarcos added a commit that referenced this pull request Jan 12, 2026
Cherry pick #93612 to 25.10: Fix flaky test_disallow_concurrency
pamarcos added a commit that referenced this pull request Jan 12, 2026
Cherry pick #93612 to 25.11: Fix flaky test_disallow_concurrency
clickhouse-gh bot added a commit that referenced this pull request Jan 12, 2026
Backport #93612 to 25.12: Fix flaky test_disallow_concurrency
@robot-clickhouse robot-clickhouse added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jan 12, 2026
pamarcos added a commit that referenced this pull request Jan 12, 2026
Backport #93612 to 25.3: Fix flaky test_disallow_concurrency
clickhouse-gh bot added a commit that referenced this pull request Jan 12, 2026
Backport #93612 to 25.11: Fix flaky test_disallow_concurrency
pamarcos added a commit that referenced this pull request Jan 13, 2026
Backport #93612 to 25.8: Fix flaky test_disallow_concurrency
pamarcos added a commit that referenced this pull request Jan 13, 2026
Backport #93612 to 25.10: Fix flaky test_disallow_concurrency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment