CI: enable failpoint injection in stress tests#100682
Merged
alexey-milovidov merged 3 commits intoClickHouse:masterfrom Mar 29, 2026
Merged
CI: enable failpoint injection in stress tests#100682alexey-milovidov merged 3 commits intoClickHouse:masterfrom
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
Conversation
Activate a set of REGULAR failpoints during the stress test phase to increase code path coverage. Only REGULAR failpoints are used — they fire-and-return on every trigger with no synchronization, making them safe for long-running stress runs (ONCE would fire once then vanish; PAUSEABLE would deadlock with nothing to resume them). Changes: - Add tests/config/config.d/fail_points_active.xml with 5 REGULAR failpoints: replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault, use_delayed_remote_source, cluster_discovery_faults, check_table_query_delay_for_part, remove_merge_tree_part_delay - install.sh: link the config when CLICKHOUSE_FAILPOINTS_INJECTION=1 - stress_runner.sh: export CLICKHOUSE_FAILPOINTS_INJECTION=1 for the stress phase; remove the config before the clean-restart check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
Workflow [PR], commit [5a593c9] Summary: ✅ AI ReviewSummaryThis PR enables controlled failpoint injection during the stress phase by adding ClickHouse Rules
Final Verdict
|
This failpoint sleeps 1300-1500ms on every MergeTree part removal. During DETACH DATABASE with database_atomic_wait_for_drop_and_detach_synchronously=1, the server must remove all parts before completing. With many parts accumulated during stress, this caused the hung check to trigger (391s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
groeneai
added a commit
to groeneai/ClickHouse
that referenced
this pull request
Mar 30, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 14+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
groeneai
added a commit
to groeneai/ClickHouse
that referenced
this pull request
Mar 30, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 14+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Desel72
pushed a commit
to Desel72/ClickHouse
that referenced
this pull request
Mar 30, 2026
…s_tests CI: enable failpoint injection in stress tests
groeneai
added a commit
to groeneai/ClickHouse
that referenced
this pull request
Mar 30, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 14+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
groeneai
added a commit
to groeneai/ClickHouse
that referenced
this pull request
Mar 31, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 30+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. The regression test uses no-parallel tag because the use_delayed_remote_source failpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863). Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/config/config.d/fail_points_active.xmlwith 5REGULARfailpoints to activate during the stress test phaseinstall.sh: conditionally links the config whenCLICKHOUSE_FAILPOINTS_INJECTION=1stress_runner.sh: exportsCLICKHOUSE_FAILPOINTS_INJECTION=1before the stress phase; removes the config before the clean-restart checkOnly
REGULARfailpoints are included — they fire-and-return on every trigger with no synchronization, making them safe for long-running stress runs.ONCEfailpoints would fire once then auto-disable;PAUSEABLEfailpoints would deadlock the server since nothing callsSYSTEM NOTIFY FAILPOINTto resume them.Failpoints enabled:
replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_faultuse_delayed_remote_sourcecluster_discovery_faultscheck_table_query_delay_for_partremove_merge_tree_part_delayTest plan
CLICKHOUSE_FAILPOINTS_INJECTION=1🤖 Generated with Claude Code
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes