Skip to content

GH-39857: [C++] Improve error message for "chunker out of sync" condition#39892

Merged
pitrou merged 2 commits intoapache:mainfrom
pitrou:gh39857-csv-chunker-out-of-sync
Feb 6, 2024
Merged

GH-39857: [C++] Improve error message for "chunker out of sync" condition#39892
pitrou merged 2 commits intoapache:mainfrom
pitrou:gh39857-csv-chunker-out-of-sync

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Feb 1, 2024

Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the newlines_in_values option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

What changes are included in this PR?

Are these changes tested?

There's no functional change, the error message itself isn't tested.

Are there any user-facing changes?

No.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 1, 2024

⚠️ GitHub issue #39857 has been automatically assigned in GitHub to PR creator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o_O

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 2, 2024
@pitrou pitrou force-pushed the gh39857-csv-chunker-out-of-sync branch from b5a5b51 to f0e4fbd Compare February 5, 2024 16:09
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Feb 5, 2024
@pitrou pitrou force-pushed the gh39857-csv-chunker-out-of-sync branch from f0e4fbd to 72c2695 Compare February 5, 2024 17:22
@github-actions github-actions bot added Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 5, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 5, 2024
@pitrou pitrou force-pushed the gh39857-csv-chunker-out-of-sync branch from 72c2695 to 9a704c6 Compare February 6, 2024 14:06
@pitrou pitrou merged commit a6e577d into apache:main Feb 6, 2024
@pitrou pitrou removed the awaiting merge Awaiting merge label Feb 6, 2024
@pitrou pitrou deleted the gh39857-csv-chunker-out-of-sync branch February 6, 2024 15:22
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a6e577d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… condition (apache#39892)

### Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

### What changes are included in this PR?

* Add some parser tests that showcase the condition encountered in apacheGH-39857
* Improve error message to guide users towards the solution

### Are these changes tested?

There's no functional change, the error message itself isn't tested.

### Are there any user-facing changes?

No.

* Closes: apache#39857

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] CSV parser got out of sync with chunker

2 participants