Fix parallel replicas crash with lazy remote source#101154
Fix parallel replicas crash with lazy remote source#101154groeneai wants to merge 2 commits intoClickHouse:masterfrom
Conversation
Pre-PR Validation GateSession: cron:clickhouse-ci-task-worker:20260330-021500 a) Deterministic repro? SYSTEM ENABLE FAILPOINT use_delayed_remote_source;
SET allow_experimental_parallel_reading_from_replicas = 1, max_parallel_replicas = 3;
SET cluster_for_parallel_replicas = 'parallel_replicas';
SET parallel_replicas_for_non_replicated_merge_tree = 1;
SELECT sum(a) FROM <distributed_table_over_2_shard_cluster>;b) Root cause explained? c) Fix matches root cause? d) New tests added? e) Demonstrated in both directions?
|
|
cc @nickitat @zlareb1 — could you review this? This fixes a P0 LOGICAL_ERROR crash in parallel replicas that was triggered by the |
|
Workflow [PR], commit [dd950b8] Summary: ❌
AI ReviewSummaryThis PR fixes a real asymmetry between ClickHouse Rules
Final Verdict
|
|
@groeneai can you give list of PRs? |
|
@zlareb1 Here's the full list from CIDB. 30 crash instances across 22 distinct PRs + master in the last 30 days (all since PR #100682 enabled the Master (4 hits):
PRs (26 hits across 22 PRs):
All PRs are completely unrelated to each other and to parallel replicas — the crash is a pre-existing trunk bug exposed by the CIDB query used: SELECT pull_request_number, check_name, check_start_time, commit_sha
FROM default.checks
WHERE test_name LIKE '%STID: 5066%'
AND test_status IN ('FAIL', 'ERROR')
AND check_start_time > now() - INTERVAL 30 DAY
ORDER BY check_start_time DESC |
|
@groeneai fast tests failed |
072fdec to
eb41921
Compare
|
@pufit Fixed, thanks for flagging. Two issues:
Verified 20/20 locally. |
eb41921 to
919278f
Compare
|
@pufit @nickitat @zlareb1 — Pushed a fix for the STID 5798 regression ("Unexpected lazy remote read from a non-replicated table" server crashes across all flaky checks). Root cause: The This is the same race condition that test Fixes (commit 919278f):
Also changed changelog category to "CI Fix or Improvement" since the bug only triggers via the debug-only failpoint (bugfix validation can't reproduce against release binary). Session: cron:clickhouse-ci-task-worker:20260330-231500 |
|
@groeneai
Adding |
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 30+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. The regression test uses no-parallel tag because the use_delayed_remote_source failpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863). Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
919278f to
975fdb9
Compare
|
@devcrafter You're right — reverted the graceful fallback. The Pushed a clean single commit (975fdb9) that now contains only:
The |
|
@groeneai Investigate this flaky test failure: Also, send a separate PR with that fix. |
|
@alexey-milovidov Investigated and fixed in a separate PR: #101437 Root cause: When CI randomizes CIDB evidence: 10 failures across 7 unrelated PRs in 30 days (amd_tsan, amd_asan_ubsan, amd_msan, amd_debug, arm_binary, amd_llvm_coverage). Fix: Pin |
LLVM Coverage Report
Changed lines: 25.00% (5/20) · Uncovered code |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
N/A
What is the problem?
STID 5066:
ReadFromRemote::addLazyPipe()was missing thecluster_for_parallel_replicassetting override thataddPipe()has. When the lazy remote source path is used (forced by theuse_delayed_remote_sourcefailpoint enabled in stress tests by #100682), the_shard_numscalar from the distributed execution context leaks intoprepareClusterForParallelReplicas()with a value that exceeds the parallel replicas cluster's shard count.For example, shard 2 of a 2-shard distributed cluster sets
_shard_num=2, butcluster_for_parallel_replicas(typically 1-shard with N replicas) only has 1 shard — causing a LOGICAL_ERROR crash.This regression started on March 29, 2026 when #100682 enabled the
use_delayed_remote_sourcefailpoint in stress tests. 30+ crashes across 22 unrelated PRs + 4 master hits in 2 days.Related issue: #81738
How was it fixed?
Added the same
cluster_for_parallel_replicasoverride toReadFromRemote::addLazyPipe()that already exists inReadFromRemote::addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths.The regression test uses
no-paralleltag because theuse_delayed_remote_sourcefailpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863).Note: Changed from "Bug Fix" to "CI Fix" because the bug only triggers via the
use_delayed_remote_sourcefailpoint which is debug-only (disabled in release builds).How was the fix verified?