Skip to content

[Python] Add support for "is" and "is not" to pyarrow.parquet.filters_to_expression #34504

@rjzamora

Description

@rjzamora

Describe the enhancement requested

As discussed in dask/dask#9845, it is not currently possible to pass a List[Tuple]-formatted filter to filters_to_expression that can be converted into an expression that will filter null/nan values correctly. This is because something like [("a", "!=", np.nan) will result in <pyarrow.compute.Expression (a != nan)> instead of <pyarrow.compute.Expression invert(is_null(a, {nan_is_null=true}))>.

@j-bennet suggested that we add support for "is" and "is not" operators for this, and @jorisvandenbossche seemed receptive to this idea.

Since this was a blocking issue for some RAPIDS users, dask has temporarily added its own custom implementation of filters_to_expression. Is there any interest in adopting a similar version of this utility in pyarrow?

Note: It would be nice if the filters_to_expression utility made it possible to mimic pandas behavior and avoid null propagation for predicates like [("a", "!=", 1)] (by taking such a predicate to mean <pyarrow.compute.Expression (is_null(a, {nan_is_null=true}) or (a != 1))> on the arrow side). Dask's custom implementation adds the propagate_null argument for this purpose.

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions