Skip to content

Custom utility to convert parquet filters to pyarrow expression#9885

Merged
rjzamora merged 11 commits intodask:mainfrom
rjzamora:custom-filters-to-expression
Mar 7, 2023
Merged

Custom utility to convert parquet filters to pyarrow expression#9885
rjzamora merged 11 commits intodask:mainfrom
rjzamora:custom-filters-to-expression

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jan 27, 2023

Possible solution for the filtering issues reported in #9845

@rjzamora rjzamora marked this pull request as ready for review March 6, 2023 19:01
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora

@jorisvandenbossche does this look like something that you'd like to see mirrored on the pyarrow side? (xref #9845 (comment))

return bool(filtered_cols - partition_cols)


def _filters_to_expression(filters, propagate_null=False, nan_is_null=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why, but it's unfortunate that we need to patch a function this complex. Is there a way we could only use the patched version of filters_to_expression if we're applying a filter that needs patching?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could probably flatten the filter and look for "is", "is not", "!=", or "not in". However, I don't think that does much to simplify the patch (nor does it make the code easier to understand or maintain).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally agree it won't make the patch any smaller. I was mostly concerned about drift between this patch and the corresponding function upstream in pyarrow. Though maybe we have sufficient test coverage to catch any drift in our upstream build before a pyarrow release happens

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that drift is a concern. We had a similar plan for write_to_dataset three years ago, and divergence in supported file-naming options is now a new blocker (see: #9968).

Overall, I'm hopeful that pyarrow will agree with the proposed changes here (or adopt something similar that we can use). However, I'm pretty sure dask is the primary consumer of their filters_to_expression utility, and so I wouldn't consider it a tragedy if dask needed to take ownership of this logic longer-term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds good. I'll defer to your judgement here. Based on #9845 (comment) for @jorisvandenbossche it sounds like this sort of change would be welcome on the pyarrow side

@jrbourbeau jrbourbeau mentioned this pull request Mar 7, 2023
5 tasks
@rjzamora rjzamora merged commit 5fc9b90 into dask:main Mar 7, 2023
@rjzamora rjzamora deleted the custom-filters-to-expression branch March 7, 2023 19:49
@rjzamora
Copy link
Member Author

rjzamora commented Mar 7, 2023

Note that I'll be happy to follow up with revisions to these changes if pyarrow decides to resolve the null-value problem in a different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_parquet filter bug with nulls

2 participants