Custom utility to convert parquet filters to pyarrow expression#9885
Custom utility to convert parquet filters to pyarrow expression#9885
Conversation
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @rjzamora
@jorisvandenbossche does this look like something that you'd like to see mirrored on the pyarrow side? (xref #9845 (comment))
| return bool(filtered_cols - partition_cols) | ||
|
|
||
|
|
||
| def _filters_to_expression(filters, propagate_null=False, nan_is_null=True): |
There was a problem hiding this comment.
I see why, but it's unfortunate that we need to patch a function this complex. Is there a way we could only use the patched version of filters_to_expression if we're applying a filter that needs patching?
There was a problem hiding this comment.
Yes, we could probably flatten the filter and look for "is", "is not", "!=", or "not in". However, I don't think that does much to simplify the patch (nor does it make the code easier to understand or maintain).
There was a problem hiding this comment.
Yeah, totally agree it won't make the patch any smaller. I was mostly concerned about drift between this patch and the corresponding function upstream in pyarrow. Though maybe we have sufficient test coverage to catch any drift in our upstream build before a pyarrow release happens
There was a problem hiding this comment.
I agree that drift is a concern. We had a similar plan for write_to_dataset three years ago, and divergence in supported file-naming options is now a new blocker (see: #9968).
Overall, I'm hopeful that pyarrow will agree with the proposed changes here (or adopt something similar that we can use). However, I'm pretty sure dask is the primary consumer of their filters_to_expression utility, and so I wouldn't consider it a tragedy if dask needed to take ownership of this logic longer-term.
There was a problem hiding this comment.
Okay, sounds good. I'll defer to your judgement here. Based on #9845 (comment) for @jorisvandenbossche it sounds like this sort of change would be welcome on the pyarrow side
|
Note that I'll be happy to follow up with revisions to these changes if pyarrow decides to resolve the null-value problem in a different way. |
Possible solution for the filtering issues reported in #9845
pre-commit run --all-files