[Data] Make Zip op streaming by owenowenisme · Pull Request #58721 · ray-project/ray

owenowenisme · 2025-11-18T00:24:39Z

Description

Making Zip operator streaming based on StreamingRepartition

Related issues

gemini-code-assist

Code Review

This pull request refactors the Zip operator to be a streaming operator instead of a bulk one. This is a significant and positive change for performance and memory usage. The approach taken is to introduce a new logical optimizer rule, AddStreamingRepartitionWhenZip, which ensures that all inputs to a Zip operation are repartitioned to have the same number of rows per block. This simplifies the ZipOperator's logic, as it no longer needs to handle complex block alignment.

The implementation is solid and includes new tests for the optimizer rule. I have a couple of suggestions for improvement: one for performance in the remote zip task, and another for making the repartitioning target configurable.

gemini-code-assist · 2025-11-18T00:27:07Z

python/ray/data/_internal/execution/operators/zip_operator.py

+    merged_blocks = []
+    for blocks in block_groups:
+        builder = DelegatingBlockBuilder()
+        for block in blocks:
+            builder.add_block(ray.get(block))
+        merged_blocks.append(builder.build())


The current implementation fetches blocks sequentially within the remote task, which can be inefficient. You can improve performance by fetching all blocks in parallel using a single ray.get() call before merging them.

Suggested change

merged_blocks = []

for blocks in block_groups:

builder = DelegatingBlockBuilder()

for block in blocks:

builder.add_block(ray.get(block))

merged_blocks.append(builder.build())

all_block_refs = [block for blocks in block_groups for block in blocks]

all_blocks_resolved = ray.get(all_block_refs)

block_map = dict(zip(all_block_refs, all_blocks_resolved))

merged_blocks = []

for blocks in block_groups:

builder = DelegatingBlockBuilder()

for block_ref in blocks:

builder.add_block(block_map[block_ref])

merged_blocks.append(builder.build())

gemini-code-assist · 2025-11-18T00:27:07Z

python/ray/data/_internal/logical/rules/add_streaming_repartition_when_zip.py

+class AddStreamingRepartitionWhenZip(Rule):
+    """Insert StreamingRepartition before each Zip input so blocks align."""
+
+    TARGET_NUM_ROWS_PER_BLOCK = 128


The TARGET_NUM_ROWS_PER_BLOCK is hardcoded to 128. This might not be optimal for all use cases. Consider making this value configurable, for example, through the DataContext. This would provide more flexibility for users to tune performance based on their specific data characteristics (e.g., row size).

Signed-off-by: You-Cheng Lin <mses010108@gmail.com> # Conflicts: # python/ray/data/_internal/logical/optimizers.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

python/ray/data/_internal/execution/operators/zip_operator.py

python/ray/data/_internal/logical/rules/operator_fusion.py

python/ray/data/_internal/logical/rules/add_streaming_repartition_when_zip.py

python/ray/data/_internal/execution/operators/zip_operator.py

python/ray/data/_internal/logical/rules/operator_fusion.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

…streaming

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/_internal/execution/operators/zip_operator.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

…streaming

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/block.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor · 2026-01-02T15:18:23Z

python/ray/data/_internal/execution/operators/zip_operator.py

-        self._input_buffers[input_index].append(refs)
+        assert 0 <= input_index < len(self._input_dependencies), input_index
+        self._pending_bundles[input_index].append(refs)
+        self._pending_rows[input_index] += refs.num_rows()


TypeError when bundle has unknown row count

RefBundle.num_rows() returns Optional[int] and can return None when block metadata doesn't have a row count. The code at line 131 performs self._pending_rows[input_index] += refs.num_rows() which raises a TypeError if num_rows() is None. This propagates to line 186 where min(self._pending_rows) fails with mixed int/None values, and line 243 where bundle_rows <= rows_remaining comparison fails. Several data sources like BigQuery, Delta Sharing, and Hudi legitimately set num_rows=None in metadata. The previous bulk implementation handled this by fetching row counts remotely, but the new streaming implementation lacks this safeguard.

Additional Locations (2)

python/ray/data/_internal/execution/operators/zip_operator.py#L185-L186

python/ray/data/_internal/execution/operators/zip_operator.py#L242-L243

python/ray/data/_internal/execution/operators/zip_operator.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor · 2026-01-05T14:33:49Z

python/ray/data/_internal/execution/operators/zip_operator.py

+                for bundle in consumed:
+                    self._metrics.on_input_dequeued(bundle)
+                    total_size_bytes += bundle.size_bytes()
+                    owns_blocks = owns_blocks and bundle.owns_blocks


Output bundle size_bytes is zero when slicing occurs

The total_size_bytes calculation only counts bundles that are fully consumed, but when bundles are sliced (split across multiple tasks), the consumed list is empty until the final slice is used. This means output bundles created from sliced data will have size_bytes=0 in their metadata, while the final task gets the entire original bundle's size. This causes inaccurate memory accounting despite implements_accurate_memory_accounting() returning True. The fix is to calculate size from merged_bundles (which contains the actual processed data) instead of from consumed bundles.

Additional Locations (1)

python/ray/data/_internal/execution/operators/zip_operator.py#L239-L241

cursor · 2026-01-05T14:33:49Z

python/ray/data/_internal/execution/operators/zip_operator.py

+                for bundle in consumed:
+                    self._metrics.on_input_dequeued(bundle)
+                    total_size_bytes += bundle.size_bytes()
+                    owns_blocks = owns_blocks and bundle.owns_blocks


Output bundle ownership incorrectly based on input bundles

The output bundle's owns_blocks is derived from consumed input bundles rather than recognizing that the remote task creates a new block. When input bundles don't own their blocks (e.g., shared bundles) and are consumed whole without slicing, owns_blocks becomes False for the output. However, the output contains a newly created block from _zip_blocks_with_slices that has no other references, so it should always be owned. This could prevent eager memory cleanup when destroy_if_owned is called, since the output incorrectly believes it doesn't own its block.

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

github-actions · 2026-01-22T12:25:49Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

owenowenisme requested a review from a team as a code owner November 18, 2025 00:24

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

owenowenisme added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Nov 18, 2025

owenowenisme and others added 2 commits December 5, 2025 19:15

update

124e589

Signed-off-by: You-Cheng Lin <mses010108@gmail.com> # Conflicts: # python/ray/data/_internal/logical/optimizers.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing MapBatches->StreamingRepartition fusion

40df837

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the data/make-zip-op-streaming branch from 39f042f to 40df837 Compare December 6, 2025 03:41

cursor bot reviewed Dec 6, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/zip_operator.py Outdated Show resolved Hide resolved

python/ray/data/_internal/logical/rules/operator_fusion.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Dec 6, 2025

View reviewed changes

Merge branch 'master' into data/make-zip-op-streaming

9e403f9

cursor bot reviewed Dec 9, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/zip_operator.py Outdated Show resolved Hide resolved

python/ray/data/_internal/logical/rules/operator_fusion.py Outdated Show resolved Hide resolved

owenowenisme added 5 commits December 9, 2025 22:10

update

577e80d

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

revert fusion

1839c2a

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

remove deleted test in bazel

3b8e020

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

Merge remote-tracking branch 'upstream/master' into data/make-zip-op-…

24cfad4

…streaming

update test

966368e

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Dec 10, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/zip_operator.py Outdated Show resolved Hide resolved

owenowenisme added 3 commits December 15, 2025 22:03

remove zip op from MATERIALIZING_OPERATORS

18d0e20

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

Merge remote-tracking branch 'upstream/master' into data/make-zip-op-…

39b0940

…streaming

make minimize ray get metadata in zip op

2f1efa7

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Jan 2, 2026

View reviewed changes

python/ray/data/block.py Show resolved Hide resolved

Merge branch 'master' into data/make-zip-op-streaming

c1a49e5

owenowenisme force-pushed the data/make-zip-op-streaming branch from 96b5865 to c1a49e5 Compare January 2, 2026 04:15

try only modify metadata in zip

be4bdc3

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Jan 2, 2026

View reviewed changes

owenowenisme added 2 commits January 3, 2026 01:54

try only modify metadata in zip

a764725

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

update

5f97dac

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Jan 5, 2026

View reviewed changes

owenowenisme and others added 2 commits January 5, 2026 22:39

fetch block in remote task all at once

86e49fb

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

Merge branch 'master' into data/make-zip-op-streaming

5fed4dd

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 22, 2026

owenowenisme added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jan 25, 2026

owenowenisme mentioned this pull request Feb 21, 2026

[Data] Zip operator needs to be tagged as requiring materialization #59021

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Make Zip op streaming#58721

[Data] Make Zip op streaming#58721
owenowenisme wants to merge 17 commits intoray-project:masterfrom
owenowenisme:data/make-zip-op-streaming

owenowenisme commented Nov 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 2, 2026

Uh oh!

Uh oh!

cursor bot Jan 5, 2026

Uh oh!

cursor bot Jan 5, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

owenowenisme commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 2, 2026

Choose a reason for hiding this comment

TypeError when bundle has unknown row count

Uh oh!

Uh oh!

cursor bot Jan 5, 2026

Choose a reason for hiding this comment

Output bundle size_bytes is zero when slicing occurs

Uh oh!

cursor bot Jan 5, 2026

Choose a reason for hiding this comment

Output bundle ownership incorrectly based on input bundles

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

owenowenisme commented Nov 18, 2025 •

edited

Loading