Skip to content

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 21, 2025

pick (#53374)

Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter:

  • The previous concurrent will read the complete row1 and read a little more to read the line delimiter.
  • The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…r line delimiter (apache#53374)

Multiple concurrent split file locations will be determined in plan
phase, if the split point happens to be in the middle of the multi char
line delimiter:

- The previous concurrent will read the complete row1 and read a little
more to read the line delimiter.
- The latter concurrency will start reading from half of the multi char
line delimiter, and row2 is the first line of this concurrency, but the
first line in the middle range is always discarded, so row2 will be
lost.
@sollhui sollhui requested a review from dataroaring as a code owner July 21, 2025 06:46
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui sollhui changed the title branch-3.1: [fix](csv reader) fix data loss when concurrency read using multi char line delimiter (#53374) branch-3.0: [fix](csv reader) fix data loss when concurrency read using multi char line delimiter (#53374) Jul 21, 2025
@sollhui
Copy link
Contributor Author

sollhui commented Jul 21, 2025

run buildall

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 41.89% (11135/26579)
Line Coverage 32.44% (95397/294098)
Region Coverage 31.57% (49264/156048)
Branch Coverage 27.99% (25241/90184)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 50.87% (13344/26233)
Line Coverage 41.34% (121417/293702)
Region Coverage 38.97% (70430/180717)
Branch Coverage 33.37% (34018/101948)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 6e07b92 into apache:branch-3.0 Jul 21, 2025
20 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants