Skip to content

Conversation

@brunal
Copy link
Contributor

@brunal brunal commented Jul 4, 2025

Reverts #16342

After that change, one cannot write an empty RecordBatch with a schema to parquet anymore.

Indeed, the logic added tries to write an empty recordbatch when num_rows == 0.

So if you tried to write an empty recordbatch, you end up writing it twice => it fails.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jul 4, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brunal

Can you please also add a test showing the problem you encountered (aka a test for writing an empty record batch stream to a parquet file) so we don't accidentally reintroduce the same problem?

cc @chenkovsky

@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

Looks like there may be a test in

brunal@aed5316

@brunal brunal force-pushed the revert-16342-fix/empty_stream branch from a4dc97c to f5a77b1 Compare July 7, 2025 18:58
@brunal
Copy link
Contributor Author

brunal commented Jul 7, 2025

Indeed! I have just updated the revert commit with this test.

@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

The CI tests seem to have some problems unfortunately

@brunal brunal force-pushed the revert-16342-fix/empty_stream branch from f5a77b1 to f93a15c Compare July 8, 2025 06:26
This reverts commit 4084894.

It adds a test showcasing the functionality that the commit above broke:
writing a parquet file from an empty RecordBatch.
@brunal brunal force-pushed the revert-16342-fix/empty_stream branch from f93a15c to ab5e7a3 Compare July 8, 2025 06:40
@brunal
Copy link
Contributor Author

brunal commented Jul 8, 2025

Apologies for that; it's now fixed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @brunal

I pushed a commit to also verify the schema of the batch that is read from the empty parquet file

@chenkovsky can you help me review this PR and maybe help find a fix that works with csv and other formats?

@chenkovsky
Copy link
Contributor

@alamb @brunal sorry for inconvenience, revert it is ok. I'm trying to find a new way to implement this feature.

@alamb
Copy link
Contributor

alamb commented Jul 8, 2025

@alamb @brunal sorry for inconvenience, revert it is ok. I'm trying to find a new way to implement this feature.

Thank you @brunal and @chenkovsky

@alamb alamb merged commit 33be09a into apache:main Jul 8, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants