GH-39857: [C++] Improve error message for "chunker out of sync" condition by pitrou · Pull Request #39892 · apache/arrow

pitrou · 2024-02-01T18:04:29Z

Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the newlines_in_values option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

What changes are included in this PR?

Add some parser tests that showcase the condition encountered in [R] CSV parser got out of sync with chunker #39857
Improve error message to guide users towards the solution

Are these changes tested?

There's no functional change, the error message itself isn't tested.

Are there any user-facing changes?

No.

Closes: [R] CSV parser got out of sync with chunker #39857

github-actions · 2024-02-01T18:04:57Z

⚠️ GitHub issue #39857 has been automatically assigned in GitHub to PR creator.

bkietz · 2024-02-02T15:37:18Z

cpp/src/arrow/csv/parser_test.cc

… condition

conbench-apache-arrow · 2024-02-07T02:34:03Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a6e577d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

… condition (apache#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in apacheGH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: apache#39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ awaiting review Awaiting review labels Feb 1, 2024

pitrou requested a review from bkietz February 1, 2024 18:11

bkietz requested changes Feb 2, 2024

View reviewed changes

cpp/src/arrow/csv/parser_test.cc Outdated

Copy link
Copy Markdown

Member

bkietz Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o_O

cpp/src/arrow/csv/parser_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 2, 2024

pitrou force-pushed the gh39857-csv-chunker-out-of-sync branch from b5a5b51 to f0e4fbd Compare February 5, 2024 16:09

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Feb 5, 2024

pitrou force-pushed the gh39857-csv-chunker-out-of-sync branch from f0e4fbd to 72c2695 Compare February 5, 2024 17:22

github-actions bot added Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 5, 2024

bkietz approved these changes Feb 5, 2024

View reviewed changes

cpp/src/arrow/csv/parser_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 5, 2024

pitrou added 2 commits February 6, 2024 14:58

apacheGH-39857: [C++] Improve error message for "chunker out of sync"…

7d03352

… condition

Improve comments

9a704c6

pitrou force-pushed the gh39857-csv-chunker-out-of-sync branch from 72c2695 to 9a704c6 Compare February 6, 2024 14:06

pitrou merged commit a6e577d into apache:main Feb 6, 2024

pitrou removed the awaiting merge Awaiting merge label Feb 6, 2024

pitrou mentioned this pull request Feb 6, 2024

[C++] Small CSV reader refactoring #39962

Closed

pitrou deleted the gh39857-csv-chunker-out-of-sync branch February 6, 2024 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39857: [C++] Improve error message for "chunker out of sync" condition#39892

GH-39857: [C++] Improve error message for "chunker out of sync" condition#39892
pitrou merged 2 commits intoapache:mainfrom
pitrou:gh39857-csv-chunker-out-of-sync

pitrou commented Feb 1, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 1, 2024

Uh oh!

bkietz Feb 2, 2024

Uh oh!

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Feb 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pitrou commented Feb 1, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Feb 1, 2024

Uh oh!

bkietz Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Feb 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pitrou commented Feb 1, 2024 •

edited by github-actions bot

Loading