[Data] Fix checkpoint filter PyArrow zero-copy conversion error by dragongu · Pull Request #59839 · ray-project/ray

dragongu · 2026-01-05T02:22:53Z

Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays

Issue

Fixes ArrowInvalid error when checkpoint filtering converts PyArrow chunks to NumPy arrays with zero_copy_only=True:

  File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block
    masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk
    ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True)
  File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy.

Reproduction

#!/usr/bin/env python3
import ray
from ray.data import DataContext
from ray.data.checkpoint import CheckpointConfig
import tempfile

ray.init()

with tempfile.TemporaryDirectory() as ckpt_dir, \
     tempfile.TemporaryDirectory() as data_dir, \
     tempfile.TemporaryDirectory() as output_dir:
    # Step 1: Create data
    ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir)

    # Step 2: Enable checkpoint and write
    ctx = DataContext.get_current()
    ctx.checkpoint_config = CheckpointConfig(
        checkpoint_path=ckpt_dir,
        id_column="id",
        delete_checkpoint_on_success=False
    )
    ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir)

    # Step 3: Second write triggers checkpoint filtering
    ray.data.read_parquet(data_dir).write_parquet(output_dir)

ray.shutdown()

Solution

Change to_numpy(zero_copy_only=True) to to_numpy(zero_copy_only=False) in BatchBasedCheckpointFilter.filter_rows_for_block(). This allows PyArrow to copy data when necessary.

Changes

File: ray/python/ray/data/checkpoint/checkpoint_filter.py

Line 229: Changed ckpt_chunk.to_numpy(zero_copy_only=True) to ckpt_chunk.to_numpy(zero_copy_only=False)

Performance Impact

No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible.

gemini-code-assist

Code Review

This pull request correctly addresses the ArrowInvalid error that occurs during checkpoint filtering when a zero-copy conversion from a PyArrow array to a NumPy array is not possible. Changing zero_copy_only=True to False is the right solution, allowing PyArrow to perform a data copy when necessary. This ensures robustness without introducing performance regressions in cases where a zero-copy conversion is possible. I have one minor suggestion to improve code consistency.

python/ray/data/checkpoint/checkpoint_filter.py

Signed-off-by: dragongu <andrewgu@vip.qq.com>

owenowenisme

@dragongu Do you have a minimal repro script for this?

Thanks

dragongu · 2026-01-06T04:10:12Z

@dragongu Do you have a minimal repro script for this?

Thanks

@owenowenisme Reproduction scripts have been added.

owenowenisme

LGTM

…project#59839) # Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays ## Issue Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow chunks to NumPy arrays with `zero_copy_only=True`: ``` File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks)) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True) File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True ``` This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy. ## Reproduction ```python #!/usr/bin/env python3 import ray from ray.data import DataContext from ray.data.checkpoint import CheckpointConfig import tempfile ray.init() with tempfile.TemporaryDirectory() as ckpt_dir, \ tempfile.TemporaryDirectory() as data_dir, \ tempfile.TemporaryDirectory() as output_dir: # Step 1: Create data ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir) # Step 2: Enable checkpoint and write ctx = DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( checkpoint_path=ckpt_dir, id_column="id", delete_checkpoint_on_success=False ) ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir) # Step 3: Second write triggers checkpoint filtering ray.data.read_parquet(data_dir).write_parquet(output_dir) ray.shutdown() ``` ## Solution Change `to_numpy(zero_copy_only=True)` to `to_numpy(zero_copy_only=False)` in `BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows PyArrow to copy data when necessary. ### Changes **File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py` - Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to `ckpt_chunk.to_numpy(zero_copy_only=False)` ### Performance Impact No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible. Signed-off-by: dragongu <andrewgu@vip.qq.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

…project#59839) # Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays ## Issue Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow chunks to NumPy arrays with `zero_copy_only=True`: ``` File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks)) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True) File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True ``` This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy. ## Reproduction ```python #!/usr/bin/env python3 import ray from ray.data import DataContext from ray.data.checkpoint import CheckpointConfig import tempfile ray.init() with tempfile.TemporaryDirectory() as ckpt_dir, \ tempfile.TemporaryDirectory() as data_dir, \ tempfile.TemporaryDirectory() as output_dir: # Step 1: Create data ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir) # Step 2: Enable checkpoint and write ctx = DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( checkpoint_path=ckpt_dir, id_column="id", delete_checkpoint_on_success=False ) ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir) # Step 3: Second write triggers checkpoint filtering ray.data.read_parquet(data_dir).write_parquet(output_dir) ray.shutdown() ``` ## Solution Change `to_numpy(zero_copy_only=True)` to `to_numpy(zero_copy_only=False)` in `BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows PyArrow to copy data when necessary. ### Changes **File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py` - Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to `ckpt_chunk.to_numpy(zero_copy_only=False)` ### Performance Impact No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible. Signed-off-by: dragongu <andrewgu@vip.qq.com>

…project#59839) # Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays ## Issue Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow chunks to NumPy arrays with `zero_copy_only=True`: ``` File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks)) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True) File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True ``` This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy. ## Reproduction ```python #!/usr/bin/env python3 import ray from ray.data import DataContext from ray.data.checkpoint import CheckpointConfig import tempfile ray.init() with tempfile.TemporaryDirectory() as ckpt_dir, \ tempfile.TemporaryDirectory() as data_dir, \ tempfile.TemporaryDirectory() as output_dir: # Step 1: Create data ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir) # Step 2: Enable checkpoint and write ctx = DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( checkpoint_path=ckpt_dir, id_column="id", delete_checkpoint_on_success=False ) ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir) # Step 3: Second write triggers checkpoint filtering ray.data.read_parquet(data_dir).write_parquet(output_dir) ray.shutdown() ``` ## Solution Change `to_numpy(zero_copy_only=True)` to `to_numpy(zero_copy_only=False)` in `BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows PyArrow to copy data when necessary. ### Changes **File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py` - Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to `ckpt_chunk.to_numpy(zero_copy_only=False)` ### Performance Impact No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible. Signed-off-by: dragongu <andrewgu@vip.qq.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

dragongu requested a review from a team as a code owner January 5, 2026 02:22

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Outdated Show resolved Hide resolved

[Data] Fix checkpoint filter PyArrow zero-copy conversion error

5503b85

Signed-off-by: dragongu <andrewgu@vip.qq.com>

dragongu force-pushed the fix/checkpoint branch from 092ecde to 5503b85 Compare January 5, 2026 02:28

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 5, 2026

owenowenisme reviewed Jan 5, 2026

View reviewed changes

owenowenisme mentioned this pull request Jan 6, 2026

[Data] support checkpoint with string-typed id column #59868

Closed

owenowenisme added the go add ONLY when ready to merge, run all tests label Jan 6, 2026

owenowenisme approved these changes Jan 6, 2026

View reviewed changes

bveeramani merged commit c49b666 into ray-project:master Jan 7, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix checkpoint filter PyArrow zero-copy conversion error#59839

[Data] Fix checkpoint filter PyArrow zero-copy conversion error#59839
bveeramani merged 1 commit intoray-project:masterfrom
dragongu:fix/checkpoint

dragongu commented Jan 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

owenowenisme left a comment

Uh oh!

dragongu commented Jan 6, 2026

Uh oh!

owenowenisme left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dragongu commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix ArrowInvalid error in checkpoint filter when converting PyArrow chunks to NumPy arrays

Issue

Reproduction

Solution

Changes

Performance Impact

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

dragongu commented Jan 6, 2026

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dragongu commented Jan 5, 2026 •

edited

Loading

Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays