[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040)#58071
[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040)#58071aslonnie merged 1 commit intoreleases/2.51.0from
Conversation
…#58040) ## Description This change addresses the issues that currently upon column renaming we're not removing original columns. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the projection fusion logic to correctly handle column renames and prevent renamed columns from being incorrectly kept in the output. The changes are centered around making the renaming intent explicit in AliasExpr and simplifying the expression evaluation logic. The core logic seems sound and the refactoring improves clarity. I've added a comment to fix a potentially buggy test assertion.
| assert select_op.exprs == [ | ||
| # TODO fix (renaming doesn't remove prev columns) | ||
| col("sepal.length").alias("length"), | ||
| col("petal.width").alias("width"), | ||
| ] |
There was a problem hiding this comment.
The assertion select_op.exprs == [...] is problematic because Expr.__eq__ is overloaded to create a new expression for filtering, not for equality checking. This means the test might not be correctly verifying the structure of the fused expressions. Additionally, the expected expressions are created with .alias(), but rename_columns uses the internal ._rename() method, which sets the _is_rename=True flag on the AliasExpr. This would cause a structural inequality. The test should be updated to perform a structural comparison, and the TODO comment should be addressed or removed.
| assert select_op.exprs == [ | |
| # TODO fix (renaming doesn't remove prev columns) | |
| col("sepal.length").alias("length"), | |
| col("petal.width").alias("width"), | |
| ] | |
| expected_exprs = [ | |
| col("sepal.length")._rename("length"), | |
| col("petal.width")._rename("width"), | |
| ] | |
| assert len(select_op.exprs) == len(expected_exprs) | |
| assert all(e1.structurally_equals(e2) for e1, e2 in zip(select_op.exprs, expected_exprs)) |
| # This clumsy workaround is necessary to be able to fill in Pyarrow tables | ||
| # that has to be "seeded" from existing table with N rows, and couldn't be | ||
| # started from a truly empty table. | ||
| # | ||
| # TODO fix | ||
| new_block = BlockAccessor.for_block(block).fill_column("__stub__", None) | ||
| new_block = BlockAccessor.for_block(new_block).drop(input_column_names) | ||
|
|
||
| for name, output_col in zip(names, output_cols): | ||
| new_block = BlockAccessor.for_block(new_block).fill_column(name, output_col) | ||
|
|
||
| return BlockAccessor.for_block(new_block).drop(["__stub__"]) |
There was a problem hiding this comment.
The current implementation to construct the new block is acknowledged as a 'clumsy workaround'. This can be simplified and made more efficient by directly constructing the new pyarrow.Table or pandas.DataFrame from the computed columns. For example, for Arrow blocks, pyarrow.Table.from_arrays(list(output_cols), names=list(names)) could be used. This would avoid the overhead of creating and dropping stub columns and make the code cleaner.
…ray-project#58040) (ray-project#58071) ## Description Cherry-pick of ray-project#58040 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Description
Cherry-pick of #58040
Related issues
Additional information
Description
Related issues
Additional information