[Data] Enable expressions for grouped with_column in Ray Data by YoussefEssDS · Pull Request #58231 · ray-project/ray

YoussefEssDS · 2025-10-28T01:20:55Z

Description

This PR will:
1- Introduce GroupedData.with_column, allowing grouped datasets to evaluate Ray Data expressions per group while preserving existing columns.
2- Validate the supplied expression type (reject non‑Expr and DownloadExpr since the expression evaluator can’t visit downloads as far as I understand) and reuse the projection engine so grouping flows stay aligned with the dataset-level expression API.
3- Add tests for grouped expression usage through udf-based and arithmetic expressions.

Related issues

Closes #57907

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

gemini-code-assist

Code Review

This pull request introduces GroupedData.with_column, which allows applying expressions to grouped data. The implementation is clean, reusing existing components like map_groups and eval_projection. The added tests cover both UDF-based and arithmetic expressions, ensuring the new feature is well-tested. My feedback includes a minor suggestion to improve code clarity in the tests by removing a redundant dataset creation.

gemini-code-assist · 2025-10-28T01:21:57Z

python/ray/data/tests/test_groupby_e2e.py

+    ds = ray.data.from_items(
+        [
+            {"group": 1, "value": 1},
+            {"group": 1, "value": 2},
+            {"group": 2, "value": 3},
+            {"group": 2, "value": 4},
+        ]
+    )


To improve code clarity and avoid redundancy, you can remove this duplicate dataset creation. The ds variable is already defined with the same data earlier in the test and can be reused for the second assertion.

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS · 2025-10-28T15:51:42Z

Hi @alexeykudinkin do you have any further suggestions? Thanks!

YoussefEssDS · 2025-10-29T18:54:07Z

Hi @goutamvenkat-anyscale can you please give this a review? Thanks

goutamvenkat-anyscale · 2025-10-29T23:08:45Z

python/ray/data/grouped_data.py

+            raise TypeError(
+                "expr must be a Ray Data expression created via the expression API."
+            )
+        if isinstance(expr, DownloadExpr):


Thanks for the contribution!

Curious why enforce this restriction?

I can imagine a case where we have ds.groupby('uri_prefix').with_column('bytes', download('uri')).. (Say we want to handle separate handling per group)

Thanks for the review, @goutamvenkat-anyscale! As far as I understand, in the current path, grouped with_column goes through eval_projection, which delegates to NativeExpressionEvaluator. That evaluator doesn’t support DownloadExpr (from calling download("uri")), so allowing it right now just leads to a late TypeError. I surfaced the restriction here to make things more clear. wdyt?

Ah we need to unify the expression machinery... Can you please paste the full TypeError here? Just curious whether the error is descriptive or not

Sure! The TypeError in the expression evaluator is as follows:
"DownloadExpr evaluation is not yet implemented in NativeExpressionEvaluator."

Do you think the expression evaluator error is descriptive enough (i.e., we can drop the restriction in grouped_data)? Or we keep it for further clarity? wdyt @goutamvenkat-anyscale ?

Let's leave out this check for now. We're in the process of adding Actor support for UDFExpr and DownloadExpr will be refactored to through the eval_projection flow.

Done! Can you please check if it's good to go? Thanks.

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS · 2025-10-31T19:36:53Z

@goutamvenkat-anyscale PTAL. Thanks!

goutamvenkat-anyscale

LGTM. Thanks for your contribution

YoussefEssDS · 2025-11-08T06:48:19Z

Hi @bveeramani, just bumping this. It's approved with the 'go' tag and ready to merge. Thanks!

python/ray/data/grouped_data.py

bveeramani · 2025-11-11T00:13:14Z

python/ray/data/grouped_data.py

+            from ray.data._internal.planner.plan_expression.expression_evaluator import (
+                eval_projection,
+            )
+


Nit: Is this import prone to circular import errors? If not, move to top of file for consistency with PEP8/Google convention?

It causes the doc build to fail, since the evaluator requires pyarrow. wdyt? Keep the import inside for a green doc build or move it to top?

python/ray/data/tests/test_groupby_e2e.py

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor · 2025-11-11T22:15:39Z

python/ray/data/grouped_data.py

        fn: UserDefinedFunction[DataBatch, DataBatch],
        *,
-        zero_copy_batch: bool = False,
+        zero_copy_batch: bool = True,


Bug: Breaking Change: Batches Become Read-Only

Changing the default value of zero_copy_batch from False to True in map_groups breaks existing code that mutates input batches. For example, the train_test_split method's add_train_flag function mutates the batch by adding a column, which fails with read-only errors when zero_copy_batch=True. This is a breaking API change that affects all existing map_groups calls that modify their input.

python/ray/data/grouped_data.py

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

bveeramani

LGTM with some minor comments.

For the doc failure -- if moving the import inside the function resolves the failure, I think that's okay to unblock this PR

bveeramani · 2025-11-16T02:07:16Z

python/ray/data/grouped_data.py

            memory=memory,
            concurrency=concurrency,
-            udf_modifying_row_count=True,
+            udf_modifying_row_count=False,


Map groups can change the row count, right?

bveeramani · 2025-11-16T02:08:19Z

python/ray/data/grouped_data.py

+            >>> ds.groupby("group").with_column("value_twice", col("value") * 2).sort(["group", "value"]).take_all() # doctest: +SKIP
+            [{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]


What happens if you test this (i.e., remove # docetest +SKIP)?

Ideally, I think we should avoid skipping these to prevent them from breaking

bveeramani · 2025-11-16T02:10:41Z

python/ray/data/grouped_data.py

+            >>> ds = ray.data.from_items([{"group": 1, "value": 1}, {"group": 1, "value": 2}])
+            >>> ds.groupby("group").with_column("value_twice", col("value") * 2).sort(["group", "value"]).take_all() # doctest: +SKIP
+            [{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]


Could you format this snippet? I think it might be hard to read in the rendered documentation

Suggested change

>>> ds = ray.data.from_items([{"group": 1, "value": 1}, {"group": 1, "value": 2}])

>>> ds.groupby("group").with_column("value_twice", col("value") * 2).sort(["group", "value"]).take_all() # doctest: +SKIP

[{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]

>>> ds = (

... ray.data.from_items([{"group": 1, "value": 1}, {"group": 1, "value": 2}])

... .groupby("group")

... .with_column("value_twice", col("value") * 2)

... .sort(["group", "value"])

... )

>>> ds.take_all() # doctest: +SKIP

[{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

python/ray/data/grouped_data.py

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS · 2025-11-17T19:36:33Z

Hi @bveeramani any idea how to unblock this CI error:
Error response from daemon: client version 1.52 is too new. Maximum supported API version is 1.43

bveeramani · 2025-11-18T01:54:09Z

Hi @bveeramani any idea how to unblock this CI error: Error response from daemon: client version 1.52 is too new. Maximum supported API version is 1.43

Huh, that's weird. Just updated the branch. Let's see if that fixes it

YoussefEssDS · 2025-11-18T06:59:31Z

Seems that did the trick. Thanks @bveeramani

…oject#58231) ### Description This PR will: 1- Introduce `GroupedData.with_column`, allowing grouped datasets to evaluate Ray Data expressions per group while preserving existing columns. 2- Validate the supplied expression type (reject non‑Expr and DownloadExpr since the expression evaluator can’t visit downloads as far as I understand) and reuse the projection engine so grouping flows stay aligned with the dataset-level expression API. 3- Add tests for grouped expression usage through udf-based and arithmetic expressions. ### Related issues Closes ray-project#57907 --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…oject#58231) ### Description This PR will: 1- Introduce `GroupedData.with_column`, allowing grouped datasets to evaluate Ray Data expressions per group while preserving existing columns. 2- Validate the supplied expression type (reject non‑Expr and DownloadExpr since the expression evaluator can’t visit downloads as far as I understand) and reuse the projection engine so grouping flows stay aligned with the dataset-level expression API. 3- Add tests for grouped expression usage through udf-based and arithmetic expressions. ### Related issues Closes ray-project#57907 --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…oject#58231) ### Description This PR will: 1- Introduce `GroupedData.with_column`, allowing grouped datasets to evaluate Ray Data expressions per group while preserving existing columns. 2- Validate the supplied expression type (reject non‑Expr and DownloadExpr since the expression evaluator can’t visit downloads as far as I understand) and reuse the projection engine so grouping flows stay aligned with the dataset-level expression API. 3- Add tests for grouped expression usage through udf-based and arithmetic expressions. ### Related issues Closes ray-project#57907 --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…oject#58231) ### Description This PR will: 1- Introduce `GroupedData.with_column`, allowing grouped datasets to evaluate Ray Data expressions per group while preserving existing columns. 2- Validate the supplied expression type (reject non‑Expr and DownloadExpr since the expression evaluator can’t visit downloads as far as I understand) and reuse the projection engine so grouping flows stay aligned with the dataset-level expression API. 3- Add tests for grouped expression usage through udf-based and arithmetic expressions. ### Related issues Closes ray-project#57907 --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

…oject#58231) ### Description This PR will: 1- Introduce `GroupedData.with_column`, allowing grouped datasets to evaluate Ray Data expressions per group while preserving existing columns. 2- Validate the supplied expression type (reject non‑Expr and DownloadExpr since the expression evaluator can’t visit downloads as far as I understand) and reuse the projection engine so grouping flows stay aligned with the dataset-level expression API. 3- Add tests for grouped expression usage through udf-based and arithmetic expressions. ### Related issues Closes ray-project#57907 --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

Add expression-based GroupedData.with_column along with the tests

4fc2dd0

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS requested a review from a team as a code owner October 28, 2025 01:20

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 28, 2025

Remove redundant definition and fix import order

c5514f7

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

goutamvenkat-anyscale reviewed Oct 29, 2025

View reviewed changes

YoussefEssDS force-pushed the grouped-with-column branch 2 times, most recently from 847771d to 6625d9f Compare October 31, 2025 00:28

Remove the DownloadExpr restriction

ea4a08f

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS force-pushed the grouped-with-column branch from 6625d9f to ea4a08f Compare October 31, 2025 00:32

Fix import order

8496aa9

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS requested a review from goutamvenkat-anyscale November 2, 2025 18:15

goutamvenkat-anyscale approved these changes Nov 3, 2025

View reviewed changes

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Nov 3, 2025

bveeramani requested changes Nov 11, 2025

View reviewed changes

Cleanup and add docstring example

f021ffd

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor bot reviewed Nov 11, 2025

View reviewed changes

YoussefEssDS added 3 commits November 11, 2025 17:16

Fix lint

3babfa8

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Fix docstring

713a6de

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Fix docstring

3c47d62

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS requested a review from bveeramani November 12, 2025 14:56

bveeramani self-assigned this Nov 13, 2025

richardliaw mentioned this pull request Nov 15, 2025

Ray Data Q4 Roadmap + Wishlist #58665

Open

bveeramani approved these changes Nov 16, 2025

View reviewed changes

Fix docstring and import

2ab94f2

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor bot reviewed Nov 16, 2025

View reviewed changes

python/ray/data/grouped_data.py Show resolved Hide resolved

cursor bot reviewed Nov 16, 2025

View reviewed changes

python/ray/data/grouped_data.py Show resolved Hide resolved

YoussefEssDS force-pushed the grouped-with-column branch from bd7225e to 2ab94f2 Compare November 16, 2025 18:46

YoussefEssDS added 2 commits November 16, 2025 21:49

Re-trigger CI

e40112e

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Re-trigger CI

85496cd

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Merge branch 'master' into grouped-with-column

84db703

bveeramani merged commit 303d366 into ray-project:master Nov 18, 2025
6 checks passed

		>>> ds.groupby("group").with_column("value_twice", col("value") * 2).sort(["group", "value"]).take_all() # doctest: +SKIP
		[{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]

-            >>> ds = ray.data.from_items([{"group": 1, "value": 1}, {"group": 1, "value": 2}])
-            >>> ds.groupby("group").with_column("value_twice", col("value") * 2).sort(["group", "value"]).take_all() # doctest: +SKIP
-            [{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]
+            >>> ds = (
+            ...     ray.data.from_items([{"group": 1, "value": 1}, {"group": 1, "value": 2}])
+            ...     .groupby("group")
+            ...     .with_column("value_twice", col("value") * 2)
+            ...     .sort(["group", "value"])
+            ... )
+            >>> ds.take_all() # doctest: +SKIP
+            [{'group': 1, 'value': 1, 'value_twice': 2}, {'group': 1, 'value': 2, 'value_twice': 4}]

Conversation

YoussefEssDS commented Oct 28, 2025

Description

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

YoussefEssDS commented Oct 28, 2025

Uh oh!

YoussefEssDS commented Oct 29, 2025

Uh oh!

goutamvenkat-anyscale Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YoussefEssDS commented Oct 31, 2025

Uh oh!

goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

YoussefEssDS commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot Nov 11, 2025

Choose a reason for hiding this comment

Bug: Breaking Change: Batches Become Read-Only

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YoussefEssDS commented Nov 17, 2025

Uh oh!

bveeramani commented Nov 18, 2025

Uh oh!

YoussefEssDS commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

goutamvenkat-anyscale Oct 29, 2025 •

edited

Loading

goutamvenkat-anyscale Oct 30, 2025 •

edited

Loading