Skip to content

[Data] Fix split_blocks produce empty blocks #57085

Merged
richardliaw merged 4 commits intoray-project:masterfrom
owenowenisme:data/fix-split-block-produce-empty-block
Oct 3, 2025
Merged

[Data] Fix split_blocks produce empty blocks #57085
richardliaw merged 4 commits intoray-project:masterfrom
owenowenisme:data/fix-split-block-produce-empty-block

Conversation

@owenowenisme
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme commented Oct 1, 2025

Why are these changes needed?

The ‎split_blocks function didn’t account for cases where the number of rows is smaller than the number of blocks, which resulted in many empty blocks. This change adds a guard to avoid splitting when that would produce empties.
Also added a test for this new behavior.

This will help:

  • Reduce the unecessary metadata transfer between operators
  • Downstream operators don't need to concern about handling empty blocks

test script:

ds1 = ray.data.range(1)
print(ds1.materialize())

Before fix : 64 blocks for a row, (which means other 63 block is empty)

MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})

After fix:

MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})

Related issue number

Closes #56879

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Avoid yield of empty blocks in _split_blocks, adjusting expectations and adding tests to validate correct block splitting.

  • Core:
    • map_operator._split_blocks: Skip sizes <= 0 to avoid yielding empty block slices.
  • Tests:
    • test_splitblocks.py: Add test_split_blocks validating _split_blocks matches np.array_split; import pa, BlockAccessor.
    • test_consumption.py: Update empty dataset repr expectation from num_blocks=2 to num_blocks=1.
    • test_operators.py: Adjust test_map_estimated_blocks_split to use 2-row input blocks so splitting actually occurs.

Written by Cursor Bugbot for commit 13c2466. This will update automatically on new commits. Configure here.

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@owenowenisme owenowenisme marked this pull request as ready for review October 1, 2025 11:22
@owenowenisme owenowenisme requested a review from a team as a code owner October 1, 2025 11:22
@owenowenisme owenowenisme changed the title [Data] Fix splitblock produce empty blocks [Data] Fix split_blocks produce empty blocks Oct 1, 2025
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 1, 2025
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 1, 2025
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) October 1, 2025 17:00
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
auto-merge was automatically disabled October 1, 2025 23:03

Head branch was pushed to by a user without write access

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@owenowenisme
Copy link
Copy Markdown
Member Author

@alexeykudinkin I think this pr is good to go, would you mind merging it? Thanks!

@richardliaw richardliaw merged commit 9218257 into ray-project:master Oct 3, 2025
6 checks passed
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script:
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
>
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script:
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
>
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script:
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
>
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script: 
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879 
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
> 
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script:
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
>
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The ‎`split_blocks` function didn’t account for cases where the number
of rows is smaller than the number of blocks, which resulted in many
empty blocks. This change adds a guard to avoid splitting when that
would produce empties.
Also added a test for this new behavior.

This will help:
- Reduce the unecessary metadata transfer between operators
- Downstream operators don't need to concern about handling empty blocks
<!-- Please give a short summary of the change and the problem this
solves. -->
test script:
```py
ds1 = ray.data.range(1)
print(ds1.materialize())
```
Before fix : 64 blocks for a row, (which means other 63 block is empty)
```py
MaterializedDataset(num_blocks=64, num_rows=1, schema={id: int64})
```
After fix:
```py
MaterializedDataset(num_blocks=1, num_rows=1, schema={id: int64})
```
## Related issue number
Closes ray-project#56879
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Avoid yield of empty blocks in `_split_blocks`, adjusting expectations
and adding tests to validate correct block splitting.
>
> - **Core**:
> - `map_operator._split_blocks`: Skip sizes `<= 0` to avoid yielding
empty block slices.
> - **Tests**:
> - `test_splitblocks.py`: Add `test_split_blocks` validating
`_split_blocks` matches `np.array_split`; import `pa`, `BlockAccessor`.
> - `test_consumption.py`: Update empty dataset repr expectation from
`num_blocks=2` to `num_blocks=1`.
> - `test_operators.py`: Adjust `test_map_estimated_blocks_split` to use
2-row input blocks so splitting actually occurs.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
13c2466. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Operator receives empty input blocks (BlockMetadata with num_rows=0)

4 participants