[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040) by alexeykudinkin · Pull Request #58071 · ray-project/ray

alexeykudinkin · 2025-10-24T01:00:40Z

Description

Cherry-pick of #58040

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

Additional information

Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…#58040) ## Description This change addresses the issues that currently upon column renaming we're not removing original columns. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

cursor · 2025-10-24T01:02:49Z

python/ray/data/expressions.py

            isinstance(other, AliasExpr)
            and self.expr.structurally_equals(other.expr)
            and self.name == other.name
+            and self._is_rename == self._is_rename


Bug: AliasExpr Comparison Bug

In AliasExpr.structurally_equals, the comparison self._is_rename == self._is_rename incorrectly compares the field to itself. This always evaluates to True, causing AliasExpr objects with different _is_rename flags to be considered structurally equal.

gemini-code-assist

Code Review

This pull request refactors the projection fusion logic to correctly handle column renames and prevent renamed columns from being incorrectly kept in the output. The changes are centered around making the renaming intent explicit in AliasExpr and simplifying the expression evaluation logic. The core logic seems sound and the refactoring improves clarity. I've added a comment to fix a potentially buggy test assertion.

gemini-code-assist · 2025-10-24T01:03:52Z

python/ray/data/tests/test_projection_fusion.py

+    assert select_op.exprs == [
+        # TODO fix (renaming doesn't remove prev columns)
+        col("sepal.length").alias("length"),
+        col("petal.width").alias("width"),
+    ]


The assertion select_op.exprs == [...] is problematic because Expr.__eq__ is overloaded to create a new expression for filtering, not for equality checking. This means the test might not be correctly verifying the structure of the fused expressions. Additionally, the expected expressions are created with .alias(), but rename_columns uses the internal ._rename() method, which sets the _is_rename=True flag on the AliasExpr. This would cause a structural inequality. The test should be updated to perform a structural comparison, and the TODO comment should be addressed or removed.

Suggested change

assert select_op.exprs == [

# TODO fix (renaming doesn't remove prev columns)

col("sepal.length").alias("length"),

col("petal.width").alias("width"),

]

expected_exprs = [

col("sepal.length")._rename("length"),

col("petal.width")._rename("width"),

]

assert len(select_op.exprs) == len(expected_exprs)

assert all(e1.structurally_equals(e2) for e1, e2 in zip(select_op.exprs, expected_exprs))

gemini-code-assist · 2025-10-24T01:03:52Z

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

+    # This clumsy workaround is necessary to be able to fill in Pyarrow tables
+    # that has to be "seeded" from existing table with N rows, and couldn't be
+    # started from a truly empty table.
+    #
+    # TODO fix
+    new_block = BlockAccessor.for_block(block).fill_column("__stub__", None)
+    new_block = BlockAccessor.for_block(new_block).drop(input_column_names)
+
+    for name, output_col in zip(names, output_cols):
+        new_block = BlockAccessor.for_block(new_block).fill_column(name, output_col)
+
+    return BlockAccessor.for_block(new_block).drop(["__stub__"])


The current implementation to construct the new block is acknowledged as a 'clumsy workaround'. This can be simplified and made more efficient by directly constructing the new pyarrow.Table or pandas.DataFrame from the computed columns. For example, for Arrow blocks, pyarrow.Table.from_arrays(list(output_cols), names=list(names)) could be used. This would avoid the overhead of creating and dropping stub columns and make the code cleaner.

…ray-project#58040) (ray-project#58071) ## Description Cherry-pick of ray-project#58040 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner October 24, 2025 01:00

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 24, 2025

cursor bot reviewed Oct 24, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Oct 24, 2025

aslonnie merged commit 2f4360c into releases/2.51.0 Oct 24, 2025
6 checks passed

aslonnie deleted the ak/prj-pdwn-fix-2-cp branch October 24, 2025 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040)#58071

[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040)#58071
aslonnie merged 1 commit intoreleases/2.51.0from
ak/prj-pdwn-fix-2-cp

alexeykudinkin commented Oct 24, 2025

Uh oh!

cursor bot Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    assert select_op.exprs == [
-        # TODO fix (renaming doesn't remove prev columns)
-        col("sepal.length").alias("length"),
-        col("petal.width").alias("width"),
-    ]
+expected_exprs = [
+    col("sepal.length")._rename("length"),
+    col("petal.width")._rename("width"),
+]
+assert len(select_op.exprs) == len(expected_exprs)
+assert all(e1.structurally_equals(e2) for e1, e2 in zip(select_op.exprs, expected_exprs))

Conversation

alexeykudinkin commented Oct 24, 2025

Description

Related issues

Additional information

Description

Related issues

Additional information

Uh oh!

cursor bot Oct 24, 2025

Choose a reason for hiding this comment

Bug: AliasExpr Comparison Bug

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants