Skip to content

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 16, 2025

What problem does this PR solve?

Background

When using bulk load, it was found that one piece of data was lost(expect "2060625" but real is "2060624"):
image

Root Cause
image Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter:
  • The previous concurrent will read the complete row1 and read a little more to read the line delimiter.
  • The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.
Solution

Start reading by adding the length of the line delimiter character forward

_start_offset -= line_delimiter.size();
image

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jul 16, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Jul 16, 2025

run buildall

@sollhui sollhui force-pushed the fix_csv_reader_loss_data branch from c3d2310 to 6390b03 Compare July 16, 2025 07:33
@sollhui
Copy link
Contributor Author

sollhui commented Jul 16, 2025

run buildall

@sollhui sollhui force-pushed the fix_csv_reader_loss_data branch 2 times, most recently from bdd269f to 8689240 Compare July 16, 2025 08:13
@sollhui
Copy link
Contributor Author

sollhui commented Jul 16, 2025

run buildall

@sollhui sollhui force-pushed the fix_csv_reader_loss_data branch from 8689240 to e1c9fd0 Compare July 16, 2025 08:17
@sollhui
Copy link
Contributor Author

sollhui commented Jul 16, 2025

run buildall

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 16, 2025
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 33710 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e1c9fd0e4bed6efe4326b8215287b7947e2e2186, data reload: false

------ Round 1 ----------------------------------
q1	17579	5197	5065	5065
q2	1930	264	177	177
q3	10319	1257	694	694
q4	10223	979	523	523
q5	7684	2255	2384	2255
q6	179	156	127	127
q7	875	742	608	608
q8	9323	1297	1067	1067
q9	6813	5064	5110	5064
q10	6883	2391	1966	1966
q11	502	303	274	274
q12	345	346	208	208
q13	17794	3679	3066	3066
q14	237	227	206	206
q15	549	476	479	476
q16	418	440	391	391
q17	616	859	355	355
q18	7383	7140	7210	7140
q19	1234	946	557	557
q20	330	336	219	219
q21	3800	2519	2310	2310
q22	1060	1065	962	962
Total cold run time: 106076 ms
Total hot run time: 33710 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5101	5077	5060	5060
q2	248	321	221	221
q3	2141	2687	2289	2289
q4	1303	1795	1341	1341
q5	4224	4314	4583	4314
q6	208	165	126	126
q7	1953	1966	1808	1808
q8	2616	2476	2504	2476
q9	7428	7315	7454	7315
q10	3121	3251	2909	2909
q11	659	615	502	502
q12	662	760	614	614
q13	3495	3999	3430	3430
q14	285	307	270	270
q15	514	482	471	471
q16	442	503	425	425
q17	1183	1660	1400	1400
q18	7915	7678	7619	7619
q19	799	819	951	819
q20	1897	1943	1846	1846
q21	4708	4264	4251	4251
q22	1091	1062	1012	1012
Total cold run time: 51993 ms
Total hot run time: 50518 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186366 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e1c9fd0e4bed6efe4326b8215287b7947e2e2186, data reload: false

query1	990	386	396	386
query2	6532	1723	1699	1699
query3	6773	219	219	219
query4	26525	23801	23169	23169
query5	4392	567	437	437
query6	305	219	195	195
query7	4616	501	281	281
query8	284	223	206	206
query9	8621	2632	2665	2632
query10	459	336	257	257
query11	15823	15018	14767	14767
query12	159	102	103	102
query13	1629	501	397	397
query14	9789	5742	5784	5742
query15	200	198	161	161
query16	7649	625	477	477
query17	1274	705	562	562
query18	2005	389	304	304
query19	199	187	165	165
query20	121	119	116	116
query21	215	131	116	116
query22	4272	4451	4117	4117
query23	33873	32952	33228	32952
query24	8299	2339	2342	2339
query25	562	456	388	388
query26	1218	269	146	146
query27	2761	501	336	336
query28	4325	2148	2110	2110
query29	733	554	467	467
query30	284	222	181	181
query31	944	829	767	767
query32	73	66	67	66
query33	548	359	321	321
query34	784	832	512	512
query35	798	826	742	742
query36	955	985	895	895
query37	114	97	73	73
query38	4114	4157	4105	4105
query39	1519	1446	1429	1429
query40	206	113	104	104
query41	58	53	52	52
query42	117	103	102	102
query43	529	488	472	472
query44	1292	822	810	810
query45	172	168	162	162
query46	832	1010	619	619
query47	1796	1851	1804	1804
query48	379	415	330	330
query49	736	457	387	387
query50	637	684	429	429
query51	5507	5500	5354	5354
query52	110	99	99	99
query53	215	253	180	180
query54	555	570	478	478
query55	83	81	82	81
query56	314	285	287	285
query57	1227	1191	1148	1148
query58	269	275	260	260
query59	2623	2727	2649	2649
query60	324	339	293	293
query61	127	128	125	125
query62	842	715	668	668
query63	216	189	186	186
query64	4242	975	642	642
query65	4224	4147	4158	4147
query66	1045	413	307	307
query67	15944	15788	15578	15578
query68	6013	879	533	533
query69	474	307	264	264
query70	1231	1141	1109	1109
query71	423	319	307	307
query72	5144	4656	4618	4618
query73	633	561	339	339
query74	8899	9112	8874	8874
query75	3162	3200	2680	2680
query76	3245	1144	726	726
query77	483	360	285	285
query78	10057	10141	9241	9241
query79	1426	816	562	562
query80	1341	527	470	470
query81	516	262	218	218
query82	394	125	96	96
query83	261	276	244	244
query84	243	103	99	99
query85	824	441	474	441
query86	344	314	289	289
query87	4419	4519	4318	4318
query88	2900	2294	2303	2294
query89	372	318	289	289
query90	1724	206	200	200
query91	138	141	109	109
query92	62	58	57	57
query93	1107	924	594	594
query94	667	406	304	304
query95	364	293	357	293
query96	494	555	283	283
query97	2688	2767	2670	2670
query98	219	217	217	217
query99	1294	1434	1256	1256
Total cold run time: 270918 ms
Total hot run time: 186366 ms

@dataroaring dataroaring added the usercase Important user case type label label Jul 16, 2025
@doris-robot
Copy link

ClickBench: Total hot run time: 32.63 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e1c9fd0e4bed6efe4326b8215287b7947e2e2186, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.04	0.04
query3	0.24	0.08	0.07
query4	1.62	0.11	0.11
query5	0.43	0.44	0.43
query6	1.17	0.66	0.65
query7	0.03	0.01	0.02
query8	0.04	0.04	0.03
query9	0.61	0.52	0.52
query10	0.58	0.57	0.58
query11	0.16	0.11	0.11
query12	0.15	0.12	0.11
query13	0.63	0.62	0.61
query14	0.78	0.80	0.82
query15	0.89	0.88	0.86
query16	0.38	0.38	0.40
query17	1.06	1.04	1.06
query18	0.24	0.22	0.22
query19	1.94	1.80	1.80
query20	0.01	0.00	0.01
query21	15.42	0.87	0.54
query22	0.76	1.20	0.77
query23	14.77	1.37	0.62
query24	6.85	1.74	0.94
query25	0.48	0.18	0.15
query26	0.62	0.17	0.13
query27	0.06	0.07	0.05
query28	9.30	0.87	0.44
query29	12.58	3.88	3.22
query30	3.07	3.00	2.98
query31	2.81	0.59	0.37
query32	3.23	0.54	0.47
query33	3.10	3.09	3.22
query34	16.02	5.43	4.80
query35	4.82	4.87	4.84
query36	0.69	0.51	0.48
query37	0.09	0.07	0.07
query38	0.05	0.04	0.04
query39	0.04	0.02	0.02
query40	0.18	0.14	0.14
query41	0.07	0.03	0.02
query42	0.03	0.02	0.03
query43	0.04	0.03	0.03
Total cold run time: 106.17 s
Total hot run time: 32.63 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.50% (15723/27344)
Line Coverage 46.13% (140205/303923)
Region Coverage 35.42% (104507/295057)
Branch Coverage 38.03% (46073/121141)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 81.33% (21812/26820)
Line Coverage 73.91% (224194/303314)
Region Coverage 61.52% (186241/302715)
Branch Coverage 65.43% (80142/122485)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 81.33% (21812/26820)
Line Coverage 73.91% (224194/303314)
Region Coverage 61.52% (186241/302715)
Branch Coverage 65.43% (80142/122485)

Copy link
Contributor

@suxiaogang223 suxiaogang223 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01 liaoxin01 merged commit fd6b132 into apache:master Jul 17, 2025
28 of 32 checks passed
github-actions bot pushed a commit that referenced this pull request Jul 17, 2025
…r line delimiter (#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.
morrySnow pushed a commit that referenced this pull request Jul 17, 2025
…ng multi char line delimiter #53374 (#53424)

Cherry-picked from #53374

Co-authored-by: hui lai <laihui@selectdb.com>
sollhui added a commit to sollhui/doris that referenced this pull request Jul 21, 2025
…r line delimiter (apache#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.
sollhui added a commit to sollhui/doris that referenced this pull request Jul 21, 2025
…r line delimiter (apache#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.
dataroaring pushed a commit that referenced this pull request Jul 21, 2025
…ng multi char line delimiter (#53374) (#53634)

pick (#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
yiguolei pushed a commit to sollhui/doris that referenced this pull request Jul 26, 2025
…r line delimiter (apache#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.
yiguolei pushed a commit that referenced this pull request Jul 26, 2025
…ng multi char line delimiter (#53374) (#53635)

pick (#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Hastyshell pushed a commit to Hastyshell/doris that referenced this pull request Jul 30, 2025
…r line delimiter (apache#53374) (apache#4199)

pick [(apache#53374)](apache#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.

## Proposed changes

Issue Number: close #xxx

<!--Describe your changes.-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.11-merged dev/3.0.7-merged dev/3.1.0-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants