[Data] Fix limit operator by raulchen · Pull Request #34800 · ray-project/ray

raulchen · 2023-04-26T21:14:51Z

Why are these changes needed?

Fixes a bug introduced by #34705

Related issue number

#34234

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

ericl · 2023-04-26T21:58:29Z

It seems like the limit tests are still broken:


FAILED python/ray/data/tests/test_consumption.py::test_limit_no_redundant_read[10-1]
  | FAILED python/ray/data/tests/test_consumption.py::test_limit_no_redundant_read[20-2]
  | FAILED python/ray/data/tests/test_consumption.py::test_limit_no_redundant_read[30-3]
  | FAILED python/ray/data/tests/test_consumption.py::test_limit_no_redundant_read[60-6]
  | FAILED python/ray/data/tests/test_consumption.py::test_limit_no_num_row_info

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2023-04-26T23:16:04Z

python/ray/data/tests/test_consumption.py

                    BlockMetadata(
                        num_rows=None,
-                        size_bytes=None,
+                        size_bytes=0,


If I don't change this. I get the following error.
I believe it is a bug of the test itself.
The previous implemented streaming execution for "limit" and just triggered this bug.
size_bytes isn't supposed to be None in the first place.

@ericl @c21 could you help confirm?

python/ray/data/datastream.py:2276: in take for row in self.iter_rows(): python/ray/data/iterator.py:234: in iter_rows for batch in self.iter_batches(**iter_batch_args): python/ray/data/iterator.py:167: in iter_batches block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator() python/ray/data/_internal/iterator/iterator_impl.py:33: in _to_block_iterator block_iterator, stats, executor = ds._plan.execute_to_iterator() python/ray/data/_internal/plan.py:512: in execute_to_iterator block_iter = itertools.chain([next(gen)], gen) python/ray/data/_internal/execution/legacy_compat.py:51: in execute_to_legacy_block_iterator bundle_iter = execute_to_legacy_bundle_iterator( python/ray/data/_internal/execution/legacy_compat.py:80: in execute_to_legacy_bundle_iterator dag, stats = _get_execution_dag( python/ray/data/_internal/execution/legacy_compat.py:141: in _get_execution_dag dag, stats = _to_operator_dag(plan, allow_clear_input_blocks) python/ray/data/_internal/execution/legacy_compat.py:174: in _to_operator_dag operator = _blocks_to_input_buffer(blocks, owns_blocks) python/ray/data/_internal/execution/legacy_compat.py:235: in _blocks_to_input_buffer output = _block_list_to_bundles(blocks, owns_blocks=owns_blocks) python/ray/data/_internal/execution/legacy_compat.py:359: in _block_list_to_bundles RefBundle( <string>:7: in __init__ ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = RefBundle(blocks=[(ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001000000), BlockMetadata(num_rows=None, s...None, schema=None, input_files=[], exec_stats=None))], owns_blocks=False, output_split_idx=None, _cached_location=None) def __post_init__(self): for b in self.blocks: assert isinstance(b, tuple), b assert len(b) == 2, b assert isinstance(b[0], ray.ObjectRef), b assert isinstance(b[1], BlockMetadata), b if b[1].size_bytes is None: > raise ValueError( "The size in bytes of the block must be known: {}".format(b) ) E ValueError: The size in bytes of the block must be known: (ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001000000), BlockMetadata(num_rows=None, size_bytes=None, schema=None, input_files=[], exec_stats=None))

Do we know why the limit operator would trigger the bug? Is it because the previous is not using streaming executor, and right now we are using it?

I think so - https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/plan.py#L764 . So the theory makes sense.

Can we put a meaningful number here? such as 4 * n?

c21

LGTM

raulchen · 2023-04-26T23:38:03Z

test_limit_no_redundant_read is still failing. This test checks that ray.data.read_datasource(...).limit(1) will limit the number of read tasks being run. The reason why it could pass before is also because this test goes to the old execution backend. For the streaming backend, this functionality is not implemented yet.

I think I can skip this test for now and re-enable it after implementing the optimization for streaming executor?

c21 · 2023-04-26T23:41:52Z

This looks a regresson now with new limit operator impl.
Ok from me to skip, as long as we work on implementing the optimization for streaming executor as P0 ASAP. Thanks.

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2023-04-26T23:56:00Z

This looks a regresson now with new limit operator impl.
Ok from me to skip, as long as we work on implementing the optimization for streaming executor as P0 ASAP. Thanks.

Sure. Will prioritize it.

ericl · 2023-04-27T06:06:39Z

Just one lint error.

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2023-04-27T20:41:45Z

CI failures should be unrelated. Merging.

## Why are these changes needed? Fixes a bug introduced by ray-project#34705 ## Related issue number ray-project#34234 Signed-off-by: Jack He <jackhe2345@gmail.com>

## Why are these changes needed? Fixes a bug introduced by ray-project#34705 ## Related issue number ray-project#34234

Fix limit

8735a58

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen assigned ericl Apr 26, 2023

raulchen requested review from c21, ericl and scv119 as code owners April 26, 2023 21:14

raulchen assigned c21 Apr 26, 2023

raulchen requested review from amogkam, bveeramani and scottjlee as code owners April 26, 2023 21:14

raulchen changed the title ~~[Data] Fix limit operator~~ [WIP][Data] Fix limit operator Apr 26, 2023

ericl approved these changes Apr 26, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 26, 2023

fix size_bytes

24ca206

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen commented Apr 26, 2023

View reviewed changes

c21 approved these changes Apr 26, 2023

View reviewed changes

raulchen added 3 commits April 26, 2023 16:49

size_bytes

b1c2af3

Signed-off-by: Hao Chen <chenh1024@gmail.com>

skip

6fb33b9

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix decorator

0c13446

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

e29062f

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen changed the title ~~[WIP][Data] Fix limit operator~~ [Data] Fix limit operator Apr 27, 2023

raulchen merged commit f419788 into ray-project:master Apr 27, 2023

raulchen deleted the fix-limit branch April 27, 2023 20:55

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[Data] Fix limit operator (ray-project#34800)

cb63c71

## Why are these changes needed? Fixes a bug introduced by ray-project#34705 ## Related issue number ray-project#34234

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix limit operator#34800

[Data] Fix limit operator#34800
raulchen merged 6 commits intoray-project:masterfrom
raulchen:fix-limit

raulchen commented Apr 26, 2023 •

edited

Loading

Uh oh!

ericl commented Apr 26, 2023

Uh oh!

raulchen Apr 26, 2023

Uh oh!

c21 Apr 26, 2023

Uh oh!

c21 Apr 26, 2023

Uh oh!

c21 left a comment

Uh oh!

raulchen commented Apr 26, 2023

Uh oh!

c21 commented Apr 26, 2023

Uh oh!

raulchen commented Apr 26, 2023

Uh oh!

ericl commented Apr 27, 2023

Uh oh!

raulchen commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

raulchen commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

ericl commented Apr 26, 2023

Uh oh!

raulchen Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

c21 Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

c21 Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

raulchen commented Apr 26, 2023

Uh oh!

c21 commented Apr 26, 2023

Uh oh!

raulchen commented Apr 26, 2023

Uh oh!

ericl commented Apr 27, 2023

Uh oh!

raulchen commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

raulchen commented Apr 26, 2023 •

edited

Loading