[data] New executor backend [3/n]--- Add basic operators impl by ericl · Pull Request #31305 · ray-project/ray

ericl · 2022-12-23T00:13:27Z

Why are these changes needed?

Add the initial operator implementations.

This is split out from #30903

TODO:

Add unit test for AllToAllOp
Add unit test for MapOp
Add unit test for InputDataBuffer

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2022-12-23T01:23:24Z

python/ray/data/tests/test_operators.py

+    assert _take_outputs(op) == [[i] for i in range(10)]
+
+
+def test_map_operator_ray_args(shutdown_only):


Debating whether it's worth it to mock out the Ray API here to speed up these tests a bit. Maybe it's not that important since the bulk of the testing will be for StreamingExecutor, which we can write separate mocks for.

python/ray/data/_internal/execution/operators/input_data_buffer.py

stephanie-wang · 2023-01-03T19:35:15Z

python/ray/data/_internal/execution/operators/map_operator.py

+
+        Supported strategies: {TaskPoolStrategy, ActorPoolStrategy}.
+        """
+        return self._strategy


Hmm I wonder if we can keep the implementation details of compute strategy and ray remote args etc outside of the operators? It could be cleaner if we pass in the ray.remote Callable instead of the worker's Callable as the transform_fn but not sure if this will work so I'll leave it up to you.

Yeah, I have a TODO on line 78 to clean this up in the future. I'm hoping the ComputeStrategy can turn into a simple dataclass once we migrate fully to the new backend. Right now, I avoided doing this refactoring to keep the changes self-contained.

About the callable, I think that's possible but it's probably also easier to do once we have the logical optimization layer in place (the optimizer could generate the ray.remote callable).

Signed-off-by: Eric Liang <ekhliang@gmail.com>

stephanie-wang

Looks great!

ericl · 2023-01-03T21:03:50Z

I'll hold this open until EOD for more comments.

python/ray/data/_internal/execution/operators/map_operator_tasks_impl.py

python/ray/data/_internal/execution/util.py

python/ray/data/_internal/execution/operators/map_operator_tasks_impl.py

c21 · 2023-01-03T22:44:00Z

python/ray/data/_internal/execution/operators/map_operator.py

+            input_op: Operator generating input data for this op.
+            name: The name of this operator.
+            compute_strategy: Customize the compute strategy for this op.
+            min_rows_per_batch: The number of rows to gather per batch passed to the


Should we name it as min_rows_per_fn_call? batch is kind of confusing here, as this is neither user-facing batch in map_batches, nor zero-copy batch execution we shall introduce later.

Batch seems clearer to me: it basically is the same as the user facing batch size.

I wonder if we should keep using "target_row_per_batch", since there is not guarantee for "min" here. And we should clarify it's possible the target is not met when not enough rows.

I don't think so, the previous naming was very confusing for me. The new one is clear in intent.

@ericl But that intent is incorrect: this is a target to get near to, not a minimum/floor. We add blocks to a bundle up to this target size, but we purposefully do not exceed it, so this is definitely not a minimum.

Alright, let me rename this to min_rows_per_bundle then. I don't think it's possible to be precisely unambiguous, and would prefer we keep the "min" intent which is the big picture.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

python/ray/data/_internal/execution/operators/input_data_buffer.py

python/ray/data/_internal/execution/operators/map_operator.py

jianoaix · 2023-01-03T23:15:44Z

python/ray/data/_internal/execution/operators/map_operator.py

+            input_op: Operator generating input data for this op.
+            name: The name of this operator.
+            compute_strategy: Customize the compute strategy for this op.
+            min_rows_per_batch: The number of rows to gather per batch passed to the


I wonder if we should keep using "target_row_per_batch", since there is not guarantee for "min" here. And we should clarify it's possible the target is not met when not enough rows.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

clarkzinzow

Mostly nits, the only potential blocker in my mind is the question around the block bundling logic: it appears to be dropping empty blocks, which I don't think is the current Datasets behavior.

python/ray/data/_internal/execution/interfaces.py

python/ray/data/_internal/execution/operators/input_data_buffer.py

python/ray/data/_internal/execution/operators/map_operator_tasks_impl.py

python/ray/data/tests/test_operators.py

python/ray/data/_internal/execution/operators/map_operator.py

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Eric Liang <ekhliang@gmail.com>

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-01-04T02:06:34Z

python/ray/data/_internal/execution/operators/map_operator_tasks_impl.py

+        self._obj_store_mem_peak: int = 0
+
+    def add_input(self, bundle: RefBundle) -> None:
+        if self._min_rows_per_bundle is None:


I ended up putting this back, in order to enable empty block propagation.

ericl

Updated; main changes was I removed the circular dependency between the operator impl and the wrapper operator.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Add the initial operator implementations. This is split out from #30903

…oject#31305) Add the initial operator implementations. This is split out from ray-project#30903 Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

ericl added 3 commits December 22, 2022 16:03

add operators

f83edd9

add test execution

bc8f342

wip

50b456a

ericl requested review from c21, clarkzinzow, jianoaix, jjyao and scv119 as code owners December 23, 2022 00:13

ericl added 5 commits December 22, 2022 16:15

wip

bdfef58

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add test todos

d810c61

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add data stats todo

91b2848

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Merge remote-tracking branch 'upstream/master' into operators

9e706ad

add basic tests

d4f514a

ericl changed the title ~~[WIP] [data] New executor backend [3/n]--- Add basic operators impl~~ [data] New executor backend [3/n]--- Add basic operators impl Dec 23, 2022

ericl assigned stephanie-wang, c21, clarkzinzow and jianoaix Dec 23, 2022

ericl commented Dec 23, 2022

View reviewed changes

stephanie-wang reviewed Jan 3, 2023

View reviewed changes

typo

b95a356

Signed-off-by: Eric Liang <ekhliang@gmail.com>

stephanie-wang approved these changes Jan 3, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into operators

ab4e5d7

c21 reviewed Jan 3, 2023

View reviewed changes

comments

a6e8a18

Signed-off-by: Eric Liang <ekhliang@gmail.com>

jianoaix reviewed Jan 3, 2023

View reviewed changes

comments 2

bc021c9

Signed-off-by: Eric Liang <ekhliang@gmail.com>

jianoaix approved these changes Jan 3, 2023

View reviewed changes

clarkzinzow reviewed Jan 4, 2023

View reviewed changes

ericl and others added 8 commits January 3, 2023 17:25

cleanup hierarchy

718a32e

or zero

f3d8a50

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Apply suggestions from code review

3228401

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Eric Liang <ekhliang@gmail.com>

Merge branch 'operators' of github.com:ericl/ray into operators

d1a98d6

min rows per bundle

1a8dc02

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix tests

203720e

last comment

e1d2e89

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add min rows

bf4ef1d

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl commented Jan 4, 2023

View reviewed changes

fix tests

f7cd953

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl merged commit 4195de1 into ray-project:master Jan 4, 2023

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023

[data] New executor backend [3/n]--- Add basic operators impl (#31305)

ecffa65

Add the initial operator implementations. This is split out from #30903

		assert _take_outputs(op) == [[i] for i in range(10)]


		def test_map_operator_ray_args(shutdown_only):

Conversation

ericl commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

ericl commented Jan 3, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clarkzinzow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ericl commented Dec 23, 2022 •

edited

Loading

clarkzinzow left a comment •

edited

Loading