CI: enable failpoint injection in stress tests by zlareb1 · Pull Request #100682 · ClickHouse/ClickHouse

zlareb1 · 2026-03-25T08:44:22Z

Summary

Add tests/config/config.d/fail_points_active.xml with 5 REGULAR failpoints to activate during the stress test phase
install.sh: conditionally links the config when CLICKHOUSE_FAILPOINTS_INJECTION=1
stress_runner.sh: exports CLICKHOUSE_FAILPOINTS_INJECTION=1 before the stress phase; removes the config before the clean-restart check

Only REGULAR failpoints are included — they fire-and-return on every trigger with no synchronization, making them safe for long-running stress runs. ONCE failpoints would fire once then auto-disable; PAUSEABLE failpoints would deadlock the server since nothing calls SYSTEM NOTIFY FAILPOINT to resume them.

Failpoints enabled:

Failpoint	Purpose
`replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault`	Exercises ZK commit failure recovery path
`use_delayed_remote_source`	Adds latency to remote source reads
`cluster_discovery_faults`	Injects faults in cluster discovery
`check_table_query_delay_for_part`	Delays part-level table checks
`remove_merge_tree_part_delay`	Delays MergeTree part removal

Test plan

Stress test run passes with CLICKHOUSE_FAILPOINTS_INJECTION=1
Server starts cleanly after stress phase (config removed before clean-restart check)
No regressions in existing stress test checks

🤖 Generated with Claude Code

Changelog category (leave one):

CI Fix or Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Activate a set of REGULAR failpoints during the stress test phase to increase code path coverage. Only REGULAR failpoints are used — they fire-and-return on every trigger with no synchronization, making them safe for long-running stress runs (ONCE would fire once then vanish; PAUSEABLE would deadlock with nothing to resume them). Changes: - Add tests/config/config.d/fail_points_active.xml with 5 REGULAR failpoints: replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault, use_delayed_remote_source, cluster_discovery_faults, check_table_query_delay_for_part, remove_merge_tree_part_delay - install.sh: link the config when CLICKHOUSE_FAILPOINTS_INJECTION=1 - stress_runner.sh: export CLICKHOUSE_FAILPOINTS_INJECTION=1 for the stress phase; remove the config before the clean-restart check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

clickhouse-gh · 2026-03-25T08:45:02Z

Workflow [PR], commit [5a593c9]

Summary: ✅

AI Review

Summary

This PR enables controlled failpoint injection during the stress phase by adding fail_points_active.xml, wiring it via install.sh behind CLICKHOUSE_FAILPOINTS_INJECTION=1, and removing the config before the clean-restart validation. I did not find correctness, safety, performance, or rollout issues in the changed code paths, and the PR metadata is consistent with a CI-only change.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅

Final Verdict

Status: ✅ Approve

This failpoint sleeps 1300-1500ms on every MergeTree part removal. During DETACH DATABASE with database_atomic_wait_for_drop_and_detach_synchronously=1, the server must remove all parts before completing. With many parts accumulated during stress, this caused the hung check to trigger (391s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

alexey-milovidov

Thanks, LGTM!

ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 14+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s_tests CI: enable failpoint injection in stress tests

ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 14+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 30+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. The regression test uses no-parallel tag because the use_delayed_remote_source failpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863). Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zlareb1 marked this pull request as draft March 25, 2026 08:45

CI: retrigger

fd5b1ce

clickhouse-gh bot added the pr-ci label Mar 25, 2026

zlareb1 marked this pull request as ready for review March 25, 2026 19:59

alexey-milovidov approved these changes Mar 29, 2026

View reviewed changes

alexey-milovidov self-assigned this Mar 29, 2026

alexey-milovidov merged commit d3db2b6 into ClickHouse:master Mar 29, 2026
152 of 153 checks passed

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 29, 2026

groeneai mentioned this pull request Mar 30, 2026

Fix parallel replicas crash with lazy remote source #101154

Open

Desel72 pushed a commit to Desel72/ClickHouse that referenced this pull request Mar 30, 2026

Merge pull request ClickHouse#100682 from zlareb1/failpoints_in_stres…

c91cbdd

…s_tests CI: enable failpoint injection in stress tests

alexey-milovidov mentioned this pull request Mar 31, 2026

Revert "CI: enable failpoint injection in stress tests" #101430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: enable failpoint injection in stress tests#100682

CI: enable failpoint injection in stress tests#100682
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
zlareb1:failpoints_in_stress_tests

zlareb1 commented Mar 25, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

alexey-milovidov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zlareb1 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zlareb1 commented Mar 25, 2026 •

edited

Loading

clickhouse-gh bot commented Mar 25, 2026 •

edited

Loading