Skip to content

[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040)#58071

Merged
aslonnie merged 1 commit intoreleases/2.51.0from
ak/prj-pdwn-fix-2-cp
Oct 24, 2025
Merged

[Data] Fix renamed columns to be appropriately dropped from the ouput (#58040)#58071
aslonnie merged 1 commit intoreleases/2.51.0from
ak/prj-pdwn-fix-2-cp

Conversation

@alexeykudinkin
Copy link
Copy Markdown
Contributor

Description

Cherry-pick of #58040

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

Additional information

Optional: Add implementation details, API changes, usage examples,
screenshots, etc.


Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…#58040)

## Description

This change addresses the issues that currently upon column renaming
we're not removing original columns.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner October 24, 2025 01:00
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 24, 2025
isinstance(other, AliasExpr)
and self.expr.structurally_equals(other.expr)
and self.name == other.name
and self._is_rename == self._is_rename
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: AliasExpr Comparison Bug

In AliasExpr.structurally_equals, the comparison self._is_rename == self._is_rename incorrectly compares the field to itself. This always evaluates to True, causing AliasExpr objects with different _is_rename flags to be considered structurally equal.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the projection fusion logic to correctly handle column renames and prevent renamed columns from being incorrectly kept in the output. The changes are centered around making the renaming intent explicit in AliasExpr and simplifying the expression evaluation logic. The core logic seems sound and the refactoring improves clarity. I've added a comment to fix a potentially buggy test assertion.

Comment on lines +1386 to +1390
assert select_op.exprs == [
# TODO fix (renaming doesn't remove prev columns)
col("sepal.length").alias("length"),
col("petal.width").alias("width"),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion select_op.exprs == [...] is problematic because Expr.__eq__ is overloaded to create a new expression for filtering, not for equality checking. This means the test might not be correctly verifying the structure of the fused expressions. Additionally, the expected expressions are created with .alias(), but rename_columns uses the internal ._rename() method, which sets the _is_rename=True flag on the AliasExpr. This would cause a structural inequality. The test should be updated to perform a structural comparison, and the TODO comment should be addressed or removed.

Suggested change
assert select_op.exprs == [
# TODO fix (renaming doesn't remove prev columns)
col("sepal.length").alias("length"),
col("petal.width").alias("width"),
]
expected_exprs = [
col("sepal.length")._rename("length"),
col("petal.width")._rename("width"),
]
assert len(select_op.exprs) == len(expected_exprs)
assert all(e1.structurally_equals(e2) for e1, e2 in zip(select_op.exprs, expected_exprs))

Comment on lines +743 to +754
# This clumsy workaround is necessary to be able to fill in Pyarrow tables
# that has to be "seeded" from existing table with N rows, and couldn't be
# started from a truly empty table.
#
# TODO fix
new_block = BlockAccessor.for_block(block).fill_column("__stub__", None)
new_block = BlockAccessor.for_block(new_block).drop(input_column_names)

for name, output_col in zip(names, output_cols):
new_block = BlockAccessor.for_block(new_block).fill_column(name, output_col)

return BlockAccessor.for_block(new_block).drop(["__stub__"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation to construct the new block is acknowledged as a 'clumsy workaround'. This can be simplified and made more efficient by directly constructing the new pyarrow.Table or pandas.DataFrame from the computed columns. For example, for Arrow blocks, pyarrow.Table.from_arrays(list(output_cols), names=list(names)) could be used. This would avoid the overhead of creating and dropping stub columns and make the code cleaner.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 24, 2025
@aslonnie aslonnie merged commit 2f4360c into releases/2.51.0 Oct 24, 2025
6 checks passed
@aslonnie aslonnie deleted the ak/prj-pdwn-fix-2-cp branch October 24, 2025 02:11
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Dec 11, 2025
…ray-project#58040) (ray-project#58071)

## Description

Cherry-pick of ray-project#58040

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants