-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Describe the enhancement requested
As discussed in dask/dask#9845, it is not currently possible to pass a List[Tuple]-formatted filter to filters_to_expression that can be converted into an expression that will filter null/nan values correctly. This is because something like [("a", "!=", np.nan) will result in <pyarrow.compute.Expression (a != nan)> instead of <pyarrow.compute.Expression invert(is_null(a, {nan_is_null=true}))>.
@j-bennet suggested that we add support for "is" and "is not" operators for this, and @jorisvandenbossche seemed receptive to this idea.
Since this was a blocking issue for some RAPIDS users, dask has temporarily added its own custom implementation of filters_to_expression. Is there any interest in adopting a similar version of this utility in pyarrow?
Note: It would be nice if the filters_to_expression utility made it possible to mimic pandas behavior and avoid null propagation for predicates like [("a", "!=", 1)] (by taking such a predicate to mean <pyarrow.compute.Expression (is_null(a, {nan_is_null=true}) or (a != 1))> on the arrow side). Dask's custom implementation adds the propagate_null argument for this purpose.
Component(s)
Parquet, Python