Skip to content

Conversation

@Hastyshell
Copy link
Collaborator

@Hastyshell Hastyshell commented Jul 9, 2025

What problem does this PR solve?

Related PR: #52440

Problem Summary:

In read-write splitting scenarios, some BE (Backend) nodes may have already merged certain rowset versions, while another BE still attempts to capture or access those rowsets.
When this happens, the BE reports error E-230 (versions already merged), causing data access or synchronization to fail.

This PR introduces a remote rowset fetching mechanism, allowing a BE that lacks the required rowset to fetch it from other BE nodes, instead of failing with E-230.

  • Added a remote fetch mechanism in the rowset management layer:
    When a BE detects that a rowset is missing locally but has already been merged, it will try to fetch the rowset from other BE nodes.
  • Updated version and state checking logic to correctly identify the “merged but missing” condition.
  • Adjusted the rowset access path to trigger remote fetch rather than throwing an immediate error.
  • Added tests (unit/integration) to cover the new logic where applicable.
  • Ensured backward compatibility: If the BE already has the rowset locally or read-write splitting is not enabled, the behavior remains unchanged.

Release note

Introduce a remote rowset fetching mechanism to prevent E-230 (“versions already merged”) errors in read-write splitting scenarios.
This improves BE fault tolerance when some nodes have merged versions that others have not yet synchronized.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

gavinchou pushed a commit that referenced this pull request Jul 9, 2025
… rowsets #52995 (#52715)

Related: #52995

Problem Summary:

Make stale rowsets accessible across BEs to avoid E-230 (versions
already merged) in Read-Write Splitting senario.
@Hastyshell
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 83.26% (1636/1965)
Line Coverage 67.77% (28912/42660)
Region Coverage 68.07% (14254/20939)
Branch Coverage 58.35% (7587/13002)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@Hastyshell Hastyshell force-pushed the fix-E-230-master branch 2 times, most recently from 4dc6d94 to 840053e Compare October 15, 2025 11:19
@Hastyshell
Copy link
Collaborator Author

run buildall

@Hastyshell
Copy link
Collaborator Author

run buildall

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/27) 🎉
Increment coverage report
Complete coverage report

@Hastyshell
Copy link
Collaborator Author

run buildall

@Hastyshell Hastyshell marked this pull request as ready for review October 17, 2025 05:57
@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.77% (1647/2039)
Line Coverage 67.08% (29069/43334)
Region Coverage 67.38% (14383/21346)
Branch Coverage 57.70% (7640/13240)

@Hastyshell
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.77% (1647/2039)
Line Coverage 67.03% (29048/43334)
Region Coverage 67.35% (14376/21346)
Branch Coverage 57.69% (7638/13240)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/27) 🎉
Increment coverage report
Complete coverage report

@Hastyshell
Copy link
Collaborator Author

run beut

@Hastyshell
Copy link
Collaborator Author

run buildall

1 similar comment
@Hastyshell
Copy link
Collaborator Author

run buildall

@Hastyshell
Copy link
Collaborator Author

run vault_p0

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 24.17% (146/604) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.62% (17937/34087)
Line Coverage 37.86% (162732/429871)
Region Coverage 32.31% (124239/384517)
Branch Coverage 33.68% (54364/161415)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 34.96% (215/615) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.34% (24177/33420)
Line Coverage 59.12% (254006/429646)
Region Coverage 54.93% (213941/389484)
Branch Coverage 56.26% (91319/162320)

@Hastyshell
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.62% (1647/2043)
Line Coverage 66.98% (29053/43376)
Region Coverage 67.31% (14382/21368)
Branch Coverage 57.65% (7641/13254)

@doris-robot
Copy link

TPC-H: Total hot run time: 36123 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f22b220bd4ca8c691940a2d447f90c10c51e1889, data reload: false

------ Round 1 ----------------------------------
q1	17676	5346	5158	5158
q2	1998	316	187	187
q3	10268	1371	763	763
q4	10241	973	423	423
q5	7363	894	522	522
q6	181	186	150	150
q7	788	756	547	547
q8	9400	737	540	540
q9	25243	4915	4918	4915
q10	9207	4879	4477	4477
q11	563	381	351	351
q12	436	461	300	300
q13	17819	3798	3134	3134
q14	449	444	441	441
q15	628	592	592	592
q16	561	494	464	464
q17	691	993	447	447
q18	7553	7278	7304	7278
q19	921	1030	586	586
q20	638	554	436	436
q21	4746	4155	4087	4087
q22	418	366	325	325
Total cold run time: 127788 ms
Total hot run time: 36123 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5249	5340	5305	5305
q2	375	465	368	368
q3	2248	2756	2352	2352
q4	1547	1872	1446	1446
q5	4391	4297	4298	4297
q6	231	203	139	139
q7	8187	7978	7865	7865
q8	3080	3008	2950	2950
q9	24493	24475	24581	24475
q10	4394	4752	4399	4399
q11	1352	1307	1272	1272
q12	694	785	620	620
q13	3355	3761	3144	3144
q14	418	411	395	395
q15	590	545	537	537
q16	619	626	595	595
q17	2289	2581	2419	2419
q18	7830	7413	7314	7314
q19	938	987	1093	987
q20	2049	2076	1934	1934
q21	11515	10790	10884	10790
q22	688	634	589	589
Total cold run time: 86532 ms
Total hot run time: 84192 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.84 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f22b220bd4ca8c691940a2d447f90c10c51e1889, data reload: false

query1	0.06	0.07	0.05
query2	0.11	0.05	0.05
query3	0.27	0.11	0.10
query4	1.60	0.14	0.15
query5	0.41	0.35	0.36
query6	1.14	0.83	0.86
query7	0.03	0.02	0.02
query8	0.07	0.05	0.04
query9	0.75	0.72	0.68
query10	0.65	0.70	0.69
query11	0.18	0.15	0.15
query12	0.19	0.16	0.18
query13	0.74	0.71	0.65
query14	0.88	0.90	0.90
query15	0.92	0.91	0.88
query16	0.44	0.42	0.44
query17	1.07	1.08	1.08
query18	0.28	0.27	0.26
query19	2.01	1.97	2.12
query20	0.02	0.01	0.01
query21	15.38	0.19	0.15
query22	5.08	0.07	0.06
query23	15.64	0.28	0.12
query24	3.32	0.60	1.69
query25	0.10	0.08	0.06
query26	0.18	0.17	0.16
query27	0.08	0.07	0.07
query28	5.53	1.29	1.04
query29	12.56	4.23	3.49
query30	0.28	0.20	0.19
query31	2.82	0.67	0.46
query32	3.24	0.63	0.53
query33	3.11	3.14	3.22
query34	15.86	5.21	4.66
query35	4.67	4.63	4.63
query36	0.71	0.58	0.58
query37	0.10	0.08	0.09
query38	0.07	0.06	0.05
query39	0.03	0.04	0.04
query40	0.21	0.20	0.19
query41	0.08	0.04	0.04
query42	0.07	0.03	0.04
query43	0.05	0.06	0.04
Total cold run time: 100.99 s
Total hot run time: 29.84 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/27) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 24.17% (146/604) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.62% (17939/34091)
Line Coverage 37.86% (162762/429934)
Region Coverage 32.28% (124144/384562)
Branch Coverage 33.68% (54368/161437)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 33.17% (204/615) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.35% (23883/33472)
Line Coverage 57.76% (248506/430245)
Region Coverage 52.91% (206249/389814)
Branch Coverage 54.58% (88655/162418)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/27) 🎉
Increment coverage report
Complete coverage report

@Hastyshell
Copy link
Collaborator Author

run cloud_p0

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 33.17% (204/615) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.35% (23883/33472)
Line Coverage 57.76% (248506/430245)
Region Coverage 52.91% (206249/389814)
Branch Coverage 54.58% (88655/162418)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/27) 🎉
Increment coverage report
Complete coverage report

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 23, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@gavinchou gavinchou merged commit 75d4560 into apache:master Oct 23, 2025
25 of 26 checks passed
Hastyshell added a commit to Hastyshell/doris that referenced this pull request Oct 23, 2025
… rowsets (apache#52995)

Related PR: apache#52440

In read-write splitting scenarios, some BE (Backend) nodes may have
already merged certain rowset versions, while another BE still attempts
to capture or access those rowsets.
When this happens, the BE reports error E-230 (versions already merged),
causing data access or synchronization to fail.

This PR introduces a remote rowset fetching mechanism, allowing a BE
that lacks the required rowset to fetch it from other BE nodes, instead
of failing with E-230.

- Added a remote fetch mechanism in the rowset management layer:
When a BE detects that a rowset is missing locally but has already been
merged, it will try to fetch the rowset from other BE nodes.
- Updated version and state checking logic to correctly identify the
“merged but missing” condition.
- Adjusted the rowset access path to trigger remote fetch rather than
throwing an immediate error.
- Added tests (unit/integration) to cover the new logic where
applicable.
- Ensured backward compatibility: If the BE already has the rowset
locally or read-write splitting is not enabled, the behavior remains
unchanged.

Introduce a remote rowset fetching mechanism to prevent E-230 (“versions
already merged”) errors in read-write splitting scenarios.
This improves BE fault tolerance when some nodes have merged versions
that others have not yet synchronized.
yiguolei pushed a commit that referenced this pull request Oct 24, 2025
… rowsets (#52995) (#57271)

Related PR: #52995

In read-write splitting scenarios, some BE (Backend) nodes may have
already merged certain rowset versions, while another BE still attempts
to capture or access those rowsets.
When this happens, the BE reports error E-230 (versions already merged),
causing data access or synchronization to fail.

This PR introduces a remote rowset fetching mechanism, allowing a BE
that lacks the required rowset to fetch it from other BE nodes, instead
of failing with E-230.

- Added a remote fetch mechanism in the rowset management layer: When a
BE detects that a rowset is missing locally but has already been merged,
it will try to fetch the rowset from other BE nodes.
- Updated version and state checking logic to correctly identify the
“merged but missing” condition.
- Adjusted the rowset access path to trigger remote fetch rather than
throwing an immediate error.
- Added tests (unit/integration) to cover the new logic where
applicable.
- Ensured backward compatibility: If the BE already has the rowset
locally or read-write splitting is not enabled, the behavior remains
unchanged.

Introduce a remote rowset fetching mechanism to prevent E-230 (“versions
already merged”) errors in read-write splitting scenarios. This improves
BE fault tolerance when some nodes have merged versions that others have
not yet synchronized.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [x] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [x] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [x] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
dwdwqfwe pushed a commit to dwdwqfwe/doris that referenced this pull request Oct 24, 2025
… rowsets (apache#52995)

Related PR: apache#52440 

In read-write splitting scenarios, some BE (Backend) nodes may have
already merged certain rowset versions, while another BE still attempts
to capture or access those rowsets.
When this happens, the BE reports error E-230 (versions already merged),
causing data access or synchronization to fail.

This PR introduces a remote rowset fetching mechanism, allowing a BE
that lacks the required rowset to fetch it from other BE nodes, instead
of failing with E-230.

- Added a remote fetch mechanism in the rowset management layer:
When a BE detects that a rowset is missing locally but has already been
merged, it will try to fetch the rowset from other BE nodes.
- Updated version and state checking logic to correctly identify the
“merged but missing” condition.
- Adjusted the rowset access path to trigger remote fetch rather than
throwing an immediate error.
- Added tests (unit/integration) to cover the new logic where
applicable.
- Ensured backward compatibility: If the BE already has the rowset
locally or read-write splitting is not enabled, the behavior remains
unchanged.

### Release note

Introduce a remote rowset fetching mechanism to prevent E-230 (“versions
already merged”) errors in read-write splitting scenarios.
This improves BE fault tolerance when some nodes have merged versions
that others have not yet synchronized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants