Custom utility to convert parquet filters to pyarrow expression by rjzamora · Pull Request #9885 · dask/dask

rjzamora · 2023-01-27T03:55:49Z

Possible solution for the filtering issues reported in #9845

Closes read_parquet filter bug with nulls #9845
Tests added / passed
Passes pre-commit run --all-files

…xpression

jrbourbeau

@jorisvandenbossche does this look like something that you'd like to see mirrored on the pyarrow side? (xref #9845 (comment))

dask/dataframe/io/parquet/utils.py

dask/dataframe/io/tests/test_parquet.py

jrbourbeau · 2023-03-06T22:53:41Z

dask/dataframe/io/parquet/arrow.py

    return bool(filtered_cols - partition_cols)


+def _filters_to_expression(filters, propagate_null=False, nan_is_null=True):


I see why, but it's unfortunate that we need to patch a function this complex. Is there a way we could only use the patched version of filters_to_expression if we're applying a filter that needs patching?

Yes, we could probably flatten the filter and look for "is", "is not", "!=", or "not in". However, I don't think that does much to simplify the patch (nor does it make the code easier to understand or maintain).

Yeah, totally agree it won't make the patch any smaller. I was mostly concerned about drift between this patch and the corresponding function upstream in pyarrow. Though maybe we have sufficient test coverage to catch any drift in our upstream build before a pyarrow release happens

I agree that drift is a concern. We had a similar plan for write_to_dataset three years ago, and divergence in supported file-naming options is now a new blocker (see: #9968).

Overall, I'm hopeful that pyarrow will agree with the proposed changes here (or adopt something similar that we can use). However, I'm pretty sure dask is the primary consumer of their filters_to_expression utility, and so I wouldn't consider it a tragedy if dask needed to take ownership of this logic longer-term.

Okay, sounds good. I'll defer to your judgement here. Based on #9845 (comment) for @jorisvandenbossche it sounds like this sort of change would be welcome on the pyarrow side

rjzamora · 2023-03-07T19:50:22Z

Note that I'll be happy to follow up with revisions to these changes if pyarrow decides to resolve the null-value problem in a different way.

rjzamora added 2 commits January 26, 2023 19:48

add test coverage and fix filter edge cases

531486d

add comment on copied code

1cb4ce7

github-actions bot added dataframe io labels Jan 27, 2023

rjzamora added the parquet label Jan 27, 2023

rjzamora added 2 commits January 27, 2023 06:46

avoid using pq.core._check_filters

c571966

require pyarrow>=6.0.0 to use nan_is_null

2f13b37

rjzamora mentioned this pull request Feb 6, 2023

read_parquet filter bug with nulls #9845

Closed

rjzamora added 4 commits February 26, 2023 07:29

Merge remote-tracking branch 'upstream/main' into custom-filters-to-e…

5110fe1

…xpression

fix append bug

a681959

Merge remote-tracking branch 'upstream/main' into custom-filters-to-e…

d921073

…xpression

move to 'is' and 'is not' for null comparison

febcaa4

rjzamora marked this pull request as ready for review March 6, 2023 19:01

fix comment

8bd8908

rjzamora mentioned this pull request Mar 6, 2023

Increase minimum supported pyarrow to 7.0 #10024

Merged

jrbourbeau reviewed Mar 6, 2023

View reviewed changes

rjzamora added 2 commits March 6, 2023 21:18

revisions

ccefda7

simplify test

579cb1d

jrbourbeau mentioned this pull request Mar 7, 2023

Release 2023.3.1 dask/community#312

Closed

5 tasks

rjzamora merged commit 5fc9b90 into dask:main Mar 7, 2023

rjzamora deleted the custom-filters-to-expression branch March 7, 2023 19:49

phofl mentioned this pull request Aug 12, 2024

Pyarrow <NA> filters are not being applied in read_parquet #11235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom utility to convert parquet filters to pyarrow expression#9885

Custom utility to convert parquet filters to pyarrow expression#9885
rjzamora merged 11 commits intodask:mainfrom
rjzamora:custom-filters-to-expression

rjzamora commented Jan 27, 2023 •

edited

Loading

Uh oh!

jrbourbeau left a comment

Uh oh!

Uh oh!

Uh oh!

jrbourbeau Mar 6, 2023

Uh oh!

rjzamora Mar 7, 2023

Uh oh!

jrbourbeau Mar 7, 2023

Uh oh!

rjzamora Mar 7, 2023

Uh oh!

jrbourbeau Mar 7, 2023

Uh oh!

rjzamora commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return bool(filtered_cols - partition_cols)


		def _filters_to_expression(filters, propagate_null=False, nan_is_null=True):

Uh oh!

Conversation

rjzamora commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jrbourbeau Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzamora commented Jan 27, 2023 •

edited

Loading