Skip to content

Bugfix for parquet metadata writes of empty dataframe partitions (pyarrow) #6741

Merged
martindurant merged 6 commits intodask:masterfrom
callumanoble:bugfix/to-parquet-write-empty-metadata
Oct 27, 2020
Merged

Bugfix for parquet metadata writes of empty dataframe partitions (pyarrow) #6741
martindurant merged 6 commits intodask:masterfrom
callumanoble:bugfix/to-parquet-write-empty-metadata

Conversation

@callumanoble
Copy link
Contributor

Attempting to write metadata for empty dataframe partitions with to_parquet (pyarrow impl.) can raise unintended exceptions or cause segfaults.

This PR filters out None meta data elements generated by write_partition that cause the issues. Unit tests cases demonstrating the current issue are included.

More details in the following comment #6600 (comment)

Callum Noble added 2 commits October 16, 2020 00:21

def test_parquet_pyarrow_write_empty_metadata(tmpdir):
# https://github.com/dask/dask/issues/6600
dd = pytest.importorskip("dask.dataframe")
Copy link
Member

@rjzamora rjzamora Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe dask.dataframe is already imported as dd in the header, so you shouldn't need this in either of the new tests.

(same with pd)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this was lingering after move from a different file (where reqd.)

Callum Noble added 2 commits October 16, 2020 20:34
UTs homogenous typing for all builds
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @mcguipat - My only suggestion is to allow the default scheduler in the tests (but that is not critical).

@callumanoble callumanoble requested a review from rjzamora October 21, 2020 23:00
@martindurant martindurant merged commit 405fd8c into dask:master Oct 27, 2020
@martindurant
Copy link
Member

Thanks for the review @rjzamora

@callumanoble callumanoble deleted the bugfix/to-parquet-write-empty-metadata branch October 28, 2020 19:08
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
…rrow) (dask#6741)

* [bugfix/to-parquet-write-empty-metadata] Filter out null entries in pyarrow parquet metadata writes, causes AttributeError/Segfault

* Explicit failure for exception test

* [bugfix/to-parquet-write-empty-metadata] black

* Remove unnecessary imports
UTs homogenous typing for all builds

* Placate 3.9 pre-commit

* Remove unnecessary scheduler specs

Co-authored-by: Callum Noble <C.Noble@mwam.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants