Skip to content

Fix parallel replicas crash with lazy remote source#101154

Open
groeneai wants to merge 2 commits intoClickHouse:masterfrom
groeneai:fix/parallel-replicas-lazy-shard-num-crash
Open

Fix parallel replicas crash with lazy remote source#101154
groeneai wants to merge 2 commits intoClickHouse:masterfrom
groeneai:fix/parallel-replicas-lazy-shard-num-crash

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

@groeneai groeneai commented Mar 30, 2026

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

N/A

What is the problem?

STID 5066: ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas setting override that addPipe() has. When the lazy remote source path is used (forced by the use_delayed_remote_source failpoint enabled in stress tests by #100682), the _shard_num scalar from the distributed execution context leaks into prepareClusterForParallelReplicas() with a value that exceeds the parallel replicas cluster's shard count.

For example, shard 2 of a 2-shard distributed cluster sets _shard_num=2, but cluster_for_parallel_replicas (typically 1-shard with N replicas) only has 1 shard — causing a LOGICAL_ERROR crash.

This regression started on March 29, 2026 when #100682 enabled the use_delayed_remote_source failpoint in stress tests. 30+ crashes across 22 unrelated PRs + 4 master hits in 2 days.

Related issue: #81738

How was it fixed?

Added the same cluster_for_parallel_replicas override to ReadFromRemote::addLazyPipe() that already exists in ReadFromRemote::addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths.

The regression test uses no-parallel tag because the use_delayed_remote_source failpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863).

Note: Changed from "Bug Fix" to "CI Fix" because the bug only triggers via the use_delayed_remote_source failpoint which is debug-only (disabled in release builds).

How was the fix verified?

  1. Reproduction: Enabling failpoint + distributed table over 2-shard cluster + parallel replicas → crashes before fix, passes after
  2. Regression test: 10/10 passes locally with randomization
  3. CI on previous iteration: 198,911 test results with zero failures on commit 919278f

@groeneai
Copy link
Copy Markdown
Contributor Author

Pre-PR Validation Gate

Session: cron:clickhouse-ci-task-worker:20260330-021500

a) Deterministic repro?
Yes. The following deterministically crashes the server before the fix:

SYSTEM ENABLE FAILPOINT use_delayed_remote_source;
SET allow_experimental_parallel_reading_from_replicas = 1, max_parallel_replicas = 3;
SET cluster_for_parallel_replicas = 'parallel_replicas';
SET parallel_replicas_for_non_replicated_merge_tree = 1;
SELECT sum(a) FROM <distributed_table_over_2_shard_cluster>;

b) Root cause explained?
ReadFromRemote::addPipe() overrides cluster_for_parallel_replicas to match the distributed table's cluster (line 683). ReadFromRemote::addLazyPipe() was missing this override. When the use_delayed_remote_source failpoint forces the lazy path, _shard_num from the distributed execution (e.g., 2) is sent to the remote server with the original cluster_for_parallel_replicas (e.g., parallel_replicas with 1 shard). prepareClusterForParallelReplicas() then crashes because shard_num=2 > shard_count=1.

c) Fix matches root cause?
Yes — the fix adds the identical cluster_for_parallel_replicas override to addLazyPipe(), directly addressing the missing override.

d) New tests added?
Yes — 04069_parallel_replicas_distributed_lazy_shard_num.sql reproduces the exact crash scenario.

e) Demonstrated in both directions?

  • Without fix: Server crashes with Logical error: 'Shard number is greater than shard count: shard_num=2 shard_count=1 cluster=parallel_replicas'
  • With fix: 10/10 passes, correct result returned

@groeneai
Copy link
Copy Markdown
Contributor Author

cc @nickitat @zlareb1 — could you review this? This fixes a P0 LOGICAL_ERROR crash in parallel replicas that was triggered by the use_delayed_remote_source failpoint enabled in stress tests by #100682. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(). 14+ crashes across 13 unrelated PRs in the last 2 days.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 30, 2026

Workflow [PR], commit [dd950b8]

Summary:

job_name test_name status info comment
Integration tests (amd_asan_ubsan, db disk, old analyzer, 2/6) failure
test_backup_restore_s3/test.py::test_backup_to_s3_different_credentials[data_file_name_from_checksum-native_multipart] FAIL cidb
Integration tests (amd_llvm_coverage, 5/5) failure
test_overcommit_tracker/test.py::test_user_overcommit FAIL cidb
Stress test (arm_msan) failure
Server died FAIL cidb
Cannot start clickhouse-server FAIL cidb, issue
Parse failure error failure cidb, issue
Check failed failure cidb, issue

AI Review

Summary

This PR fixes a real asymmetry between ReadFromRemote::addPipe and ReadFromRemote::addLazyPipe: the lazy path now also overrides cluster_for_parallel_replicas, preventing mismatched _shard_num values from triggering LOGICAL_ERROR under the use_delayed_remote_source failpoint. The change is small, targeted, and backed by a focused regression test that reproduces the scenario safely. High-confidence review result: no correctness, safety, or performance issues found in the submitted diff.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 30, 2026
@zlareb1
Copy link
Copy Markdown
Member

zlareb1 commented Mar 30, 2026

@groeneai can you give list of PRs?

@groeneai
Copy link
Copy Markdown
Contributor Author

@zlareb1 Here's the full list from CIDB. 30 crash instances across 22 distinct PRs + master in the last 30 days (all since PR #100682 enabled the use_delayed_remote_source failpoint on March 29):

Master (4 hits):

Time (UTC) Check Commit
2026-03-30 04:44 Stress test (arm_ubsan) d51252c
2026-03-30 04:36 Stress test (amd_debug) d51252c
2026-03-30 02:13 Stress test (arm_asan_ubsan) 696456d
2026-03-30 02:13 Stress test (arm_asan_ubsan, s3) 696456d

PRs (26 hits across 22 PRs):

PR Check(s) Latest hit
#101134 amd_msan Mar 29 23:52
#101126 amd_debug Mar 30 05:00
#101111 arm_tsan, amd_msan Mar 30 03:25
#101110 amd_tsan Mar 29 21:46
#101073 amd_debug Mar 29 21:31
#101069 amd_debug Mar 30 00:54
#101066 amd_msan Mar 29 18:26
#101051 amd_msan Mar 29 22:14
#101048 amd_debug Mar 30 04:20
#101038 arm_asan_ubsan (s3), amd_msan Mar 29 18:06
#101027 amd_msan Mar 30 06:31
#101023 amd_tsan Mar 29 19:58
#101006 amd_msan Mar 30 00:49
#100975 arm_ubsan Mar 30 03:13
#100972 arm_tsan, amd_msan Mar 30 04:42
#100941 arm_release, arm_debug Mar 30 05:24
#100606 amd_debug Mar 30 03:45
#100604 amd_debug Mar 30 05:33
#100376 amd_tsan Mar 30 05:20
#99721 amd_msan Mar 30 03:10
#99513 amd_debug Mar 29 22:26
#96802 amd_msan Mar 29 19:01

All PRs are completely unrelated to each other and to parallel replicas — the crash is a pre-existing trunk bug exposed by the use_delayed_remote_source failpoint from #100682.

CIDB query used:

SELECT pull_request_number, check_name, check_start_time, commit_sha
FROM default.checks
WHERE test_name LIKE '%STID: 5066%'
  AND test_status IN ('FAIL', 'ERROR')
  AND check_start_time > now() - INTERVAL 30 DAY
ORDER BY check_start_time DESC

@nickitat nickitat added the can be tested Allows running workflows for external contributors label Mar 30, 2026
@devcrafter devcrafter self-assigned this Mar 30, 2026
@pufit
Copy link
Copy Markdown
Member

pufit commented Mar 30, 2026

@groeneai fast tests failed

@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from 072fdec to eb41921 Compare March 30, 2026 16:21
@groeneai
Copy link
Copy Markdown
Contributor Author

@pufit Fixed, thanks for flagging. Two issues:

  1. Cluster: test_cluster_two_shard_three_replicas_localhost uses 127.0.0.2127.0.0.6 which aren't reachable in Fast test. Switched to test_cluster_two_shards_localhost (both shards on 127.0.0.1). This still exercises the bug — the crash is triggered by _shard_num = 2 from the distributed execution, regardless of which addresses the shards use.

  2. Output: The exact sum() result depends on whether Keeper is available for parallel replicas coordination. Without Keeper, parallel replicas falls back to normal distributed execution (2 shards × same local table = double sum). Changed to FORMAT Null + SELECT 'OK' since the test's purpose is verifying no crash, not checking exact values.

Verified 20/20 locally.

@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from eb41921 to 919278f Compare March 30, 2026 23:34
@clickhouse-gh clickhouse-gh bot added the pr-ci label Mar 30, 2026
@groeneai
Copy link
Copy Markdown
Contributor Author

@pufit @nickitat @zlareb1 — Pushed a fix for the STID 5798 regression ("Unexpected lazy remote read from a non-replicated table" server crashes across all flaky checks).

Root cause: The use_delayed_remote_source failpoint is global — when test 04069 enables it, ALL concurrent distributed queries force lazy=true in SelectStreamFactory (including queries on non-replicated MergeTree tables). If the failpoint is disabled before the lazy lambda executes, the lambda hits the !use_delayed_remote_source branch and crashes because the storage isn't StorageReplicatedMergeTree.

This is the same race condition that test 02863_delayed_source_with_totals_and_extremes already handles with -- Tags: no-parallel.

Fixes (commit 919278f):

  1. Added no-parallel tag to test 04069 (prevents failpoint from affecting concurrent tests)
  2. Replaced the LOGICAL_ERROR assertion with a graceful fallback to local plan for non-replicated tables in the lazy path — since delay-based replica selection doesn't apply to non-replicated tables, createLocalPlan() is the correct behavior (matches what SelectStreamFactory::createForShardImpl() does at line 247)

Also changed changelog category to "CI Fix or Improvement" since the bug only triggers via the debug-only failpoint (bugfix validation can't reproduce against release binary).

Session: cron:clickhouse-ci-task-worker:20260330-231500

@clickhouse-gh clickhouse-gh bot removed the pr-bugfix Pull request with bugfix, not backported by default label Mar 30, 2026
@devcrafter
Copy link
Copy Markdown
Member

devcrafter commented Mar 31, 2026

@groeneai
We shouldn't do this :

Replaced the LOGICAL_ERROR assertion with a graceful fallback to local plan for non-replicated tables in the lazy path — since delay-based replica selection doesn't apply to non-replicated tables, createLocalPlan() is the correct behavior (matches what SelectStreamFactory::createForShardImpl() does at line 247)

Adding no-parallel tag to test should be enough

ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas
override that addPipe() has. When the use_delayed_remote_source failpoint
forces the lazy path, the _shard_num scalar from the distributed execution
leaks into prepareClusterForParallelReplicas with a value exceeding the
parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash.

This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source
failpoint in stress tests, causing 30+ crashes across unrelated PRs.

The fix adds the same cluster_for_parallel_replicas override to addLazyPipe()
that already exists in addPipe(), ensuring the parallel replicas cluster
matches the distributed table's cluster in both code paths.

The regression test uses no-parallel tag because the use_delayed_remote_source
failpoint is global — concurrent tests could trigger "Unexpected lazy remote
read from a non-replicated table" crashes (same pattern as test 02863).

Closes ClickHouse#81738

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from 919278f to 975fdb9 Compare March 31, 2026 12:20
@groeneai
Copy link
Copy Markdown
Contributor Author

@devcrafter You're right — reverted the graceful fallback. The no-parallel tag is sufficient to prevent the race.

Pushed a clean single commit (975fdb9) that now contains only:

  1. The cluster_for_parallel_replicas override in addLazyPipe() (the actual STID 5066 fix)
  2. Regression test with no-parallel tag

The createLocalPlan fallback for non-replicated tables is removed — the original LOGICAL_ERROR assertion stays.

@alexey-milovidov
Copy link
Copy Markdown
Member

@groeneai Investigate this flaky test failure: 02377_extend_protocol_with_query_parameters and fix it as well:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101154&sha=975fdb909a96ac079fbc750c2a4b078e3dda6a9c&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

Also, send a separate PR with that fix.

Copy link
Copy Markdown
Member

@devcrafter devcrafter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

@groeneai
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov Investigated and fixed in a separate PR: #101437

Root cause: When CI randomizes map_serialization_version to with_buckets, Map keys are distributed into hash buckets which changes iteration order on readback. The test inserts into a MergeTree table with Map columns and expects a specific key order — but bucketed serialization doesn't preserve insertion order.

CIDB evidence: 10 failures across 7 unrelated PRs in 30 days (amd_tsan, amd_asan_ubsan, amd_msan, amd_debug, arm_binary, amd_llvm_coverage).

Fix: Pin map_serialization_version='basic' in the test's CREATE TABLE SETTINGS. Targeted — the test validates query parameter passing, not Map serialization.

@antaljanosbenjamin antaljanosbenjamin mentioned this pull request Apr 1, 2026
1 task
@devcrafter devcrafter enabled auto-merge April 1, 2026 14:03
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 1, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.00% -0.10%
Functions 90.90% 90.90% +0.00%
Branches 76.60% 76.50% -0.10%

Changed lines: 25.00% (5/20) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants