Skip to content

CI: enable failpoint injection in stress tests#100682

Merged
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
zlareb1:failpoints_in_stress_tests
Mar 29, 2026
Merged

CI: enable failpoint injection in stress tests#100682
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
zlareb1:failpoints_in_stress_tests

Conversation

@zlareb1
Copy link
Copy Markdown
Member

@zlareb1 zlareb1 commented Mar 25, 2026

Summary

  • Add tests/config/config.d/fail_points_active.xml with 5 REGULAR failpoints to activate during the stress test phase
  • install.sh: conditionally links the config when CLICKHOUSE_FAILPOINTS_INJECTION=1
  • stress_runner.sh: exports CLICKHOUSE_FAILPOINTS_INJECTION=1 before the stress phase; removes the config before the clean-restart check

Only REGULAR failpoints are included — they fire-and-return on every trigger with no synchronization, making them safe for long-running stress runs. ONCE failpoints would fire once then auto-disable; PAUSEABLE failpoints would deadlock the server since nothing calls SYSTEM NOTIFY FAILPOINT to resume them.

Failpoints enabled:

Failpoint Purpose
replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault Exercises ZK commit failure recovery path
use_delayed_remote_source Adds latency to remote source reads
cluster_discovery_faults Injects faults in cluster discovery
check_table_query_delay_for_part Delays part-level table checks
remove_merge_tree_part_delay Delays MergeTree part removal

Test plan

  • Stress test run passes with CLICKHOUSE_FAILPOINTS_INJECTION=1
  • Server starts cleanly after stress phase (config removed before clean-restart check)
  • No regressions in existing stress test checks

🤖 Generated with Claude Code

Changelog category (leave one):

  • CI Fix or Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Activate a set of REGULAR failpoints during the stress test phase to
increase code path coverage. Only REGULAR failpoints are used — they
fire-and-return on every trigger with no synchronization, making them
safe for long-running stress runs (ONCE would fire once then vanish;
PAUSEABLE would deadlock with nothing to resume them).

Changes:
- Add tests/config/config.d/fail_points_active.xml with 5 REGULAR
  failpoints: replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault,
  use_delayed_remote_source, cluster_discovery_faults,
  check_table_query_delay_for_part, remove_merge_tree_part_delay
- install.sh: link the config when CLICKHOUSE_FAILPOINTS_INJECTION=1
- stress_runner.sh: export CLICKHOUSE_FAILPOINTS_INJECTION=1 for the
  stress phase; remove the config before the clean-restart check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 25, 2026

Workflow [PR], commit [5a593c9]

Summary:


AI Review

Summary

This PR enables controlled failpoint injection during the stress phase by adding fail_points_active.xml, wiring it via install.sh behind CLICKHOUSE_FAILPOINTS_INJECTION=1, and removing the config before the clean-restart validation. I did not find correctness, safety, performance, or rollout issues in the changed code paths, and the PR metadata is consistent with a CI-only change.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@zlareb1 zlareb1 marked this pull request as draft March 25, 2026 08:45
@clickhouse-gh clickhouse-gh bot added the pr-ci label Mar 25, 2026
This failpoint sleeps 1300-1500ms on every MergeTree part removal.
During DETACH DATABASE with database_atomic_wait_for_drop_and_detach_synchronously=1,
the server must remove all parts before completing. With many parts
accumulated during stress, this caused the hung check to trigger (391s).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zlareb1 zlareb1 marked this pull request as ready for review March 25, 2026 19:59
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@alexey-milovidov alexey-milovidov self-assigned this Mar 29, 2026
@alexey-milovidov alexey-milovidov merged commit d3db2b6 into ClickHouse:master Mar 29, 2026
152 of 153 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 29, 2026
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Mar 30, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas
override that addPipe() has. When the use_delayed_remote_source failpoint
forces the lazy path, the _shard_num scalar from the distributed execution
leaks into prepareClusterForParallelReplicas with a value exceeding the
parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash.

This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source
failpoint in stress tests, causing 14+ crashes across unrelated PRs.

The fix adds the same cluster_for_parallel_replicas override to addLazyPipe()
that already exists in addPipe(), ensuring the parallel replicas cluster
matches the distributed table's cluster in both code paths.

Closes ClickHouse#81738

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Mar 30, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas
override that addPipe() has. When the use_delayed_remote_source failpoint
forces the lazy path, the _shard_num scalar from the distributed execution
leaks into prepareClusterForParallelReplicas with a value exceeding the
parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash.

This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source
failpoint in stress tests, causing 14+ crashes across unrelated PRs.

The fix adds the same cluster_for_parallel_replicas override to addLazyPipe()
that already exists in addPipe(), ensuring the parallel replicas cluster
matches the distributed table's cluster in both code paths.

Closes ClickHouse#81738

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Desel72 pushed a commit to Desel72/ClickHouse that referenced this pull request Mar 30, 2026
…s_tests

CI: enable failpoint injection in stress tests
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Mar 30, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas
override that addPipe() has. When the use_delayed_remote_source failpoint
forces the lazy path, the _shard_num scalar from the distributed execution
leaks into prepareClusterForParallelReplicas with a value exceeding the
parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash.

This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source
failpoint in stress tests, causing 14+ crashes across unrelated PRs.

The fix adds the same cluster_for_parallel_replicas override to addLazyPipe()
that already exists in addPipe(), ensuring the parallel replicas cluster
matches the distributed table's cluster in both code paths.

Closes ClickHouse#81738

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Mar 31, 2026
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas
override that addPipe() has. When the use_delayed_remote_source failpoint
forces the lazy path, the _shard_num scalar from the distributed execution
leaks into prepareClusterForParallelReplicas with a value exceeding the
parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash.

This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source
failpoint in stress tests, causing 30+ crashes across unrelated PRs.

The fix adds the same cluster_for_parallel_replicas override to addLazyPipe()
that already exists in addPipe(), ensuring the parallel replicas cluster
matches the distributed table's cluster in both code paths.

The regression test uses no-parallel tag because the use_delayed_remote_source
failpoint is global — concurrent tests could trigger "Unexpected lazy remote
read from a non-replicated table" crashes (same pattern as test 02863).

Closes ClickHouse#81738

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants