Skip to content

[Cherry-pick] [Datasets] Support different number of blocks/rows per block in zip() (#32795)#32998

Merged
zcin merged 1 commit intoray-project:releases/2.3.1from
clarkzinzow:datasets/cherry-pick/zip
Mar 3, 2023
Merged

[Cherry-pick] [Datasets] Support different number of blocks/rows per block in zip() (#32795)#32998
zcin merged 1 commit intoray-project:releases/2.3.1from
clarkzinzow:datasets/cherry-pick/zip

Conversation

@clarkzinzow
Copy link
Copy Markdown
Contributor

This PR cherry-picks #32795 onto the 2.3.1 release branch.

ray-project#32795)

This PR adds support for a different number of blocks/rows per block in `ds1.zip(ds2)`, by aligning the blocks in `ds2` to `ds1` with a lightweight repartition/block splitting.

## Design

We heavily utilize the block splitting machinery that's use for `ds.split()` and `ds.split_at_indices()` to avoid an overly expensive repartition. Namely, for `ds1.zip(ds2)`, we:
1. Calculate the block sizes for `ds1` in order to get split indices.
2. Apply `_split_at_indices()` to `ds2` in order to get a list of `ds2` block chunks for every block in `ds1`, such that `self_block.num_rows() == sum(other_block.num_rows() for other_block in other_split_blocks)` for every `self_block` in `ds1`.
3. Zip together each block in `ds1` with the one or more blocks from `ds2` that constitute the block-aligned split for that `ds1` block.
@clarkzinzow clarkzinzow force-pushed the datasets/cherry-pick/zip branch from 6ba0275 to e03683d Compare March 3, 2023 16:24
Copy link
Copy Markdown
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good - will ask for @zhe-thoughts approval as well

Copy link
Copy Markdown
Contributor

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for cherry picking into 2.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants