[Data] [2/n] - Add predicate expressions to filter api#56365
Conversation
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for predicate expressions in Dataset.filter(), a significant enhancement to the Ray Data expression system. The changes are well-implemented, adding new expression types like UnaryExpr and PredicateExpr, expanding the set of supported operations, and integrating them into the evaluation and planning logic. The refactoring of Dataset.filter improves code clarity, and the comprehensive test suite is a great addition. I have one suggestion regarding a test condition that appears to be a typo.
| @pytest.mark.skipif( | ||
| get_pyarrow_version() < parse_version("20.0.0"), | ||
| reason="predicate expressions require PyArrow >= 20.0.0", | ||
| ) |
There was a problem hiding this comment.
The skipif condition get_pyarrow_version() < parse_version("20.0.0") will cause these tests to be skipped, as the current PyArrow version is much lower than 20.0.0. This appears to be a typo. Could you please verify the required PyArrow version and update this condition? If there's no specific version dependency, this condition should be removed to ensure these important tests are executed.
Why are these changes needed?
For
dataset.filter()add support for predicate expressions, which is part of Ray Data's expression system and will soon replace thefnand string based expr that gets evaluated as a Pyarrow expression.Building on top of: #56313
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.