Skip to content

Adjust the Parquet engine classes to allow more easily subclassing#6211

Merged
martindurant merged 4 commits intodask:masterfrom
mariusvniekerk:refactor-pyarrow-read
May 19, 2020
Merged

Adjust the Parquet engine classes to allow more easily subclassing#6211
martindurant merged 4 commits intodask:masterfrom
mariusvniekerk:refactor-pyarrow-read

Conversation

@mariusvniekerk
Copy link
Contributor

Reduces code duplication when subclassing and provides a hook for adjusting arrow / pandas conversion behavior more precisely

Add passthough for arrow_to_pandas options.

  • Tests added / passed
  • Passes black dask / flake8 dask

@mariusvniekerk
Copy link
Contributor Author

mariusvniekerk commented May 15, 2020

Background for this is that I need a place to put a hook to deal with data that has weird fat fingered parquet datetimes which cause issues when converting into pandas.

Turns out that 3019-01-05 00:00:00 is not a timestamp that pandas appreciates :D

Having some other similar splitting points in some of the other larger functions might make them a bit less scary

@mariusvniekerk mariusvniekerk force-pushed the refactor-pyarrow-read branch from 26583f9 to 29bd77f Compare May 15, 2020 13:50
@martindurant
Copy link
Member

Can you add a test specifically with your "fat-fingered" timestamp that would have caused pandas overflow? That is your specific use-case here, so it also serves to show why this change is useful.

without code duplication.

Add passthough for arrow_to_pandas options.
@mariusvniekerk mariusvniekerk force-pushed the refactor-pyarrow-read branch from 29bd77f to b9db240 Compare May 15, 2020 14:24
@martindurant
Copy link
Member

Given that these methods are only called explicitly from our own code, I have no problem with these changes. @rjzamora , do you have any objections or thoughts?

@rjzamora
Copy link
Member

Thanks for contributing here @mariusvniekerk !

Only had a chance to take a quick look on my phone, but I suspect the changes are fine. Since we do use these methods in dask_cudf, I would like to double check that this doesn't break anything there and report back.

@mariusvniekerk
Copy link
Contributor Author

Thanks @rjzamora , This should also allow you to be more specific with other fancy stuff you want to do in dask_cudf.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem reasonable to me - Thanks @mariusvniekerk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants