[fix](csv reader) fix data loss when concurrency read using multi char line delimiter #53374

sollhui · 2025-07-16T07:30:13Z

What problem does this PR solve?

Background

When using bulk load, it was found that one piece of data was lost(expect "2060625" but real is "2060624")：

Root Cause

Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter:

The previous concurrent will read the complete row1 and read a little more to read the line delimiter.
The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.

Solution

Start reading by adding the length of the line delimiter character forward

_start_offset -= line_delimiter.size();

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2025-07-16T07:30:20Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

sollhui · 2025-07-16T07:31:07Z

run buildall

sollhui · 2025-07-16T07:33:39Z

run buildall

be/src/vec/exec/format/csv/csv_reader.cpp

sollhui · 2025-07-16T08:13:57Z

run buildall

sollhui · 2025-07-16T08:17:38Z

run buildall

liaoxin01

LGTM

github-actions · 2025-07-16T08:47:29Z

PR approved by at least one committer and no changes requested.

github-actions · 2025-07-16T08:47:31Z

PR approved by anyone and no changes requested.

doris-robot · 2025-07-16T08:56:39Z

TPC-H: Total hot run time: 33710 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e1c9fd0e4bed6efe4326b8215287b7947e2e2186, data reload: false

------ Round 1 ----------------------------------
q1	17579	5197	5065	5065
q2	1930	264	177	177
q3	10319	1257	694	694
q4	10223	979	523	523
q5	7684	2255	2384	2255
q6	179	156	127	127
q7	875	742	608	608
q8	9323	1297	1067	1067
q9	6813	5064	5110	5064
q10	6883	2391	1966	1966
q11	502	303	274	274
q12	345	346	208	208
q13	17794	3679	3066	3066
q14	237	227	206	206
q15	549	476	479	476
q16	418	440	391	391
q17	616	859	355	355
q18	7383	7140	7210	7140
q19	1234	946	557	557
q20	330	336	219	219
q21	3800	2519	2310	2310
q22	1060	1065	962	962
Total cold run time: 106076 ms
Total hot run time: 33710 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5101	5077	5060	5060
q2	248	321	221	221
q3	2141	2687	2289	2289
q4	1303	1795	1341	1341
q5	4224	4314	4583	4314
q6	208	165	126	126
q7	1953	1966	1808	1808
q8	2616	2476	2504	2476
q9	7428	7315	7454	7315
q10	3121	3251	2909	2909
q11	659	615	502	502
q12	662	760	614	614
q13	3495	3999	3430	3430
q14	285	307	270	270
q15	514	482	471	471
q16	442	503	425	425
q17	1183	1660	1400	1400
q18	7915	7678	7619	7619
q19	799	819	951	819
q20	1897	1943	1846	1846
q21	4708	4264	4251	4251
q22	1091	1062	1012	1012
Total cold run time: 51993 ms
Total hot run time: 50518 ms

doris-robot · 2025-07-16T09:08:13Z

TPC-DS: Total hot run time: 186366 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e1c9fd0e4bed6efe4326b8215287b7947e2e2186, data reload: false

query1	990	386	396	386
query2	6532	1723	1699	1699
query3	6773	219	219	219
query4	26525	23801	23169	23169
query5	4392	567	437	437
query6	305	219	195	195
query7	4616	501	281	281
query8	284	223	206	206
query9	8621	2632	2665	2632
query10	459	336	257	257
query11	15823	15018	14767	14767
query12	159	102	103	102
query13	1629	501	397	397
query14	9789	5742	5784	5742
query15	200	198	161	161
query16	7649	625	477	477
query17	1274	705	562	562
query18	2005	389	304	304
query19	199	187	165	165
query20	121	119	116	116
query21	215	131	116	116
query22	4272	4451	4117	4117
query23	33873	32952	33228	32952
query24	8299	2339	2342	2339
query25	562	456	388	388
query26	1218	269	146	146
query27	2761	501	336	336
query28	4325	2148	2110	2110
query29	733	554	467	467
query30	284	222	181	181
query31	944	829	767	767
query32	73	66	67	66
query33	548	359	321	321
query34	784	832	512	512
query35	798	826	742	742
query36	955	985	895	895
query37	114	97	73	73
query38	4114	4157	4105	4105
query39	1519	1446	1429	1429
query40	206	113	104	104
query41	58	53	52	52
query42	117	103	102	102
query43	529	488	472	472
query44	1292	822	810	810
query45	172	168	162	162
query46	832	1010	619	619
query47	1796	1851	1804	1804
query48	379	415	330	330
query49	736	457	387	387
query50	637	684	429	429
query51	5507	5500	5354	5354
query52	110	99	99	99
query53	215	253	180	180
query54	555	570	478	478
query55	83	81	82	81
query56	314	285	287	285
query57	1227	1191	1148	1148
query58	269	275	260	260
query59	2623	2727	2649	2649
query60	324	339	293	293
query61	127	128	125	125
query62	842	715	668	668
query63	216	189	186	186
query64	4242	975	642	642
query65	4224	4147	4158	4147
query66	1045	413	307	307
query67	15944	15788	15578	15578
query68	6013	879	533	533
query69	474	307	264	264
query70	1231	1141	1109	1109
query71	423	319	307	307
query72	5144	4656	4618	4618
query73	633	561	339	339
query74	8899	9112	8874	8874
query75	3162	3200	2680	2680
query76	3245	1144	726	726
query77	483	360	285	285
query78	10057	10141	9241	9241
query79	1426	816	562	562
query80	1341	527	470	470
query81	516	262	218	218
query82	394	125	96	96
query83	261	276	244	244
query84	243	103	99	99
query85	824	441	474	441
query86	344	314	289	289
query87	4419	4519	4318	4318
query88	2900	2294	2303	2294
query89	372	318	289	289
query90	1724	206	200	200
query91	138	141	109	109
query92	62	58	57	57
query93	1107	924	594	594
query94	667	406	304	304
query95	364	293	357	293
query96	494	555	283	283
query97	2688	2767	2670	2670
query98	219	217	217	217
query99	1294	1434	1256	1256
Total cold run time: 270918 ms
Total hot run time: 186366 ms

doris-robot · 2025-07-16T09:13:40Z

ClickBench: Total hot run time: 32.63 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e1c9fd0e4bed6efe4326b8215287b7947e2e2186, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.04	0.04
query3	0.24	0.08	0.07
query4	1.62	0.11	0.11
query5	0.43	0.44	0.43
query6	1.17	0.66	0.65
query7	0.03	0.01	0.02
query8	0.04	0.04	0.03
query9	0.61	0.52	0.52
query10	0.58	0.57	0.58
query11	0.16	0.11	0.11
query12	0.15	0.12	0.11
query13	0.63	0.62	0.61
query14	0.78	0.80	0.82
query15	0.89	0.88	0.86
query16	0.38	0.38	0.40
query17	1.06	1.04	1.06
query18	0.24	0.22	0.22
query19	1.94	1.80	1.80
query20	0.01	0.00	0.01
query21	15.42	0.87	0.54
query22	0.76	1.20	0.77
query23	14.77	1.37	0.62
query24	6.85	1.74	0.94
query25	0.48	0.18	0.15
query26	0.62	0.17	0.13
query27	0.06	0.07	0.05
query28	9.30	0.87	0.44
query29	12.58	3.88	3.22
query30	3.07	3.00	2.98
query31	2.81	0.59	0.37
query32	3.23	0.54	0.47
query33	3.10	3.09	3.22
query34	16.02	5.43	4.80
query35	4.82	4.87	4.84
query36	0.69	0.51	0.48
query37	0.09	0.07	0.07
query38	0.05	0.04	0.04
query39	0.04	0.02	0.02
query40	0.18	0.14	0.14
query41	0.07	0.03	0.02
query42	0.03	0.02	0.03
query43	0.04	0.03	0.03
Total cold run time: 106.17 s
Total hot run time: 32.63 s

doris-robot · 2025-07-16T10:07:30Z

BE UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.50% (15723/27344)
Line Coverage	46.13% (140205/303923)
Region Coverage	35.42% (104507/295057)
Branch Coverage	38.03% (46073/121141)

hello-stephen · 2025-07-16T10:15:36Z

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	81.33% (21812/26820)
Line Coverage	73.91% (224194/303314)
Region Coverage	61.52% (186241/302715)
Branch Coverage	65.43% (80142/122485)

hello-stephen · 2025-07-16T10:41:03Z

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	81.33% (21812/26820)
Line Coverage	73.91% (224194/303314)
Region Coverage	61.52% (186241/302715)
Branch Coverage	65.43% (80142/122485)

suxiaogang223

LGTM

…r line delimiter (#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.

…ng multi char line delimiter #53374 (#53424) Cherry-picked from #53374 Co-authored-by: hui lai <laihui@selectdb.com>

…r line delimiter (apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.

…ng multi char line delimiter (#53374) (#53634) pick (#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…r line delimiter (apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.

…ng multi char line delimiter (#53374) (#53635) pick (#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…r line delimiter (apache#53374) (apache#4199) pick [(apache#53374)](apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost. ## Proposed changes Issue Number: close #xxx

sollhui force-pushed the fix_csv_reader_loss_data branch from c3d2310 to 6390b03 Compare July 16, 2025 07:33

liaoxin01 reviewed Jul 16, 2025

View reviewed changes

be/src/vec/exec/format/csv/csv_reader.cpp Show resolved Hide resolved

sollhui force-pushed the fix_csv_reader_loss_data branch 2 times, most recently from bdd269f to 8689240 Compare July 16, 2025 08:13

fix data loss when concurrency read using multi char line delimiter

e1c9fd0

sollhui force-pushed the fix_csv_reader_loss_data branch from 8689240 to e1c9fd0 Compare July 16, 2025 08:17

liaoxin01 added dev/2.1.x dev/3.0.x dev/3.1.x labels Jul 16, 2025

liaoxin01 approved these changes Jul 16, 2025

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 16, 2025

github-actions bot added the reviewed label Jul 16, 2025

dataroaring added the usercase Important user case type label label Jul 16, 2025

suxiaogang223 approved these changes Jul 17, 2025

View reviewed changes

liaoxin01 merged commit fd6b132 into apache:master Jul 17, 2025
28 of 32 checks passed

github-actions bot added dev/3.0.x-conflict dev/2.1.x-conflict labels Jul 17, 2025

github-actions bot mentioned this pull request Jul 17, 2025

branch-3.1: [fix](csv reader) fix data loss when concurrency read using multi char line delimiter #53374 #53424

Merged

morrySnow pushed a commit that referenced this pull request Jul 17, 2025

branch-3.1: [fix](csv reader) fix data loss when concurrency read usi…

eca9af8

…ng multi char line delimiter #53374 (#53424) Cherry-picked from #53374 Co-authored-by: hui lai <laihui@selectdb.com>

morrySnow added dev/3.1.0-merged and removed dev/3.1.x labels Jul 17, 2025

This was referenced Jul 21, 2025

branch-3.0: [fix](csv reader) fix data loss when concurrency read using multi char line delimiter (#53374) #53634

Merged

branch-2.1: [fix](csv reader) fix data loss when concurrency read using multi char line delimiter (#53374) #53635

Merged

dataroaring added dev/3.0.7-merged and removed dev/3.0.x dev/3.0.x-conflict labels Jul 21, 2025

yiguolei added dev/2.1.11-merged and removed dev/2.1.x dev/2.1.x-conflict labels Jul 26, 2025

gavinchou mentioned this pull request Aug 15, 2025

Release Note 3.0.7 #54868

Open

[fix](csv reader) fix data loss when concurrency read using multi char line delimiter #53374

[fix](csv reader) fix data loss when concurrency read using multi char line delimiter #53374

Uh oh!

Conversation

sollhui commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Background

Root Cause

Solution

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Jul 16, 2025

Uh oh!

sollhui commented Jul 16, 2025

Uh oh!

sollhui commented Jul 16, 2025

Uh oh!

Uh oh!

sollhui commented Jul 16, 2025

Uh oh!

sollhui commented Jul 16, 2025

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

doris-robot commented Jul 16, 2025

Uh oh!

doris-robot commented Jul 16, 2025

Uh oh!

doris-robot commented Jul 16, 2025

Uh oh!

doris-robot commented Jul 16, 2025

BE UT Coverage Report

Uh oh!

hello-stephen commented Jul 16, 2025

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Jul 16, 2025

BE Regression && UT Coverage Report

Uh oh!

suxiaogang223 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

sollhui commented Jul 16, 2025 •

edited

Loading