[Data] Fix split_blocks produce empty blocks by owenowenisme · Pull Request #57085 · ray-project/ray

owenowenisme · 2025-10-01T08:09:51Z

Why are these changes needed?

The ‎split_blocks function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties.
Also added a test for this new behavior.

This will help:

Reduce the unecessary metadata transfer between operators
Downstream operators don't need to concern about handling empty blocks

test script:

ds1 = ray.data.range(1)
print(ds1.materialize())

Before fix : 64 blocks for a row, (which means other 63 block is empty)

MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})

After fix:

MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})

Related issue number

Closes #56879

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Avoid yield of empty blocks in _split_blocks, adjusting expectations and adding tests to validate correct block splitting.

Core:
- map_operator._split_blocks: Skip sizes <= 0 to avoid yielding empty block slices.
Tests:
- test_splitblocks.py: Add test_split_blocks validating _split_blocks matches np.array_split; import pa, BlockAccessor.
- test_consumption.py: Update empty dataset repr expectation from num_blocks=2 to num_blocks=1.
- test_operators.py: Adjust test_map_estimated_blocks_split to use 2-row input blocks so splitting actually occurs.

^{Written by Cursor Bugbot for commit 13c2466. This will update automatically on new commits. Configure here.}

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme · 2025-10-02T03:55:28Z

@alexeykudinkin I think this pr is good to go, would you mind merging it? Thanks!

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

## Why are these changes needed? The ‎`split_blocks` function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties. Also added a test for this new behavior. This will help: - Reduce the unecessary metadata transfer between operators - Downstream operators don't need to concern about handling empty blocks  test script: ```py ds1 = ray.data.range(1) print(ds1.materialize()) ``` Before fix : 64 blocks for a row, (which means other 63 block is empty) ```py MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64}) ``` After fix: ```py MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64}) ``` ## Related issue number Closes ray-project#56879  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Avoid yield of empty blocks in `_split_blocks`, adjusting expectations and adding tests to validate correct block splitting. > > - **Core**: > - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding empty block slices. > - **Tests**: > - `test_splitblocks.py`: Add `test_split_blocks` validating `_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`. > - `test_consumption.py`: Update empty dataset repr expectation from `num_blocks=2` to `num_blocks=1`. > - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use 2-row input blocks so splitting actually occurs. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 13c2466. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

owenowenisme added 2 commits October 1, 2025 08:07

update

46b6722

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

add test

f11d6b9

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme marked this pull request as ready for review October 1, 2025 11:22

owenowenisme requested a review from a team as a code owner October 1, 2025 11:22

owenowenisme changed the title ~~[Data] Fix splitblock produce empty blocks~~ [Data] Fix split_blocks produce empty blocks Oct 1, 2025

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 1, 2025

alexeykudinkin approved these changes Oct 1, 2025

View reviewed changes

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 1, 2025

alexeykudinkin enabled auto-merge (squash) October 1, 2025 17:00

update test

eff1716

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

auto-merge was automatically disabled October 1, 2025 23:03
Head branch was pushed to by a user without write access

update test

13c2466

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

richardliaw merged commit 9218257 into ray-project:master Oct 3, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix split_blocks produce empty blocks #57085

[Data] Fix split_blocks produce empty blocks #57085
richardliaw merged 4 commits intoray-project:masterfrom
owenowenisme:data/fix-split-block-produce-empty-block

owenowenisme commented Oct 1, 2025 •

edited by cursor bot

Loading

Uh oh!

owenowenisme commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

owenowenisme commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

owenowenisme commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

owenowenisme commented Oct 1, 2025 •

edited by cursor bot

Loading