[data] ignore metadata for pandas block by iamjustinhsu · Pull Request #56402 · ray-project/ray

iamjustinhsu · 2025-09-10T00:22:40Z

Why are these changes needed?

Consider the following the code

import ray
# Read File (1)
source_path = "file_that_contains_tensor_strings.parquet"
ds = ray.data.read_parquet(source_path)
# Write File (2)
dest_path = "/tmp"
ds.map_batches(..., batch_format="pandas").write_parquet(dest_path)

# Read File Again (3)
new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas")

At a high level we read, write, read. On a lower-level, we convert arrow blocks -> pandas -> arrow blocks -> pandas. We have connectors and registered extension types in python/ray/air/util/tensor_extensions/, however we special case handle tensor types by converting them to TensorArrays here when we convert pandas -> arrow. During this process, however, pyarrow will store metadata about the pandas block, which will look something like this:

{
    "name": "feature1",
    "field_name": "feature1",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)",
    "metadata": null
},
{
    "name": "feature2",
    "field_name": "feature2",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)",
    "metadata": null
}

For the most part this is fine, however, when converting back to pandas, arrow will first attempt to search through the metadata("numpy_type") to restore the schema. This can be troublesome because pandas/numpy doesn't know how to handle those custom types.

In pyarrow==14.0.0, this is an issue, because it surrenders the special casing to numpy/pandas
in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this)

NOTE

This has been tested with pyarrow==14.0.0 (ray version 2.45) and this works.

Solution

To handle older pyarrow versions, we can do ignore_metadata=True

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu · 2025-09-10T20:13:47Z

python/ray/air/tests/test_tensor_extension.py

+    arrow_table = pa.Table.from_pandas(df_pandas)
+
+    # Convert back to pandas
+    df_roundtrip = arrow_table.to_pandas(ignore_metadata=True)


confirmed this will fail without ignore_metadata=True. I wrote this test here to show that iamjustinhsu#3 will solve the issue.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/arrow_block.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

alexeykudinkin · 2025-09-10T22:47:52Z

python/ray/data/context.py


 DEFAULT_ENABLE_PANDAS_BLOCK = True

+DEFAULT_PANDAS_BLOCK_IGNORE_METADATA = bool(


Let's use env_bool

## Why are these changes needed? Consider the following the code ```python import ray # Read File (1) source_path = "file_that_contains_tensor_strings.parquet" ds = ray.data.read_parquet(source_path) # Write File (2) dest_path = "/tmp" ds.map_batches(..., batch_format="pandas").write_parquet(dest_path) # Read File Again (3) new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas") ``` At a high level we read, write, read. On a lower-level, we convert arrow blocks -> pandas -> arrow blocks -> pandas. We have connectors and registered extension types in `python/ray/air/util/tensor_extensions/`, however we special case handle tensor types by converting them to `TensorArrays` [here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238) when we convert pandas -> arrow. During this process, however, pyarrow will store metadata about the pandas block, which will look something like this: ```json { "name": "feature1", "field_name": "feature1", "pandas_type": "object", "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)", "metadata": null }, { "name": "feature2", "field_name": "feature2", "pandas_type": "object", "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)", "metadata": null } ``` For the most part this is fine, however, when converting _back_ to pandas, arrow will first attempt to search through the metadata("numpy_type") to restore the schema. This can be troublesome because pandas/numpy doesn't know how to handle those custom types. In pyarrow==14.0.0, this is an issue, because it surrenders the special casing to numpy/pandas in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this) ### Solution To handle older pyarrow versions, we can do `ignore_metadata=True` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

## Why are these changes needed? Consider the following the code ```python import ray # Read File (1) source_path = "file_that_contains_tensor_strings.parquet" ds = ray.data.read_parquet(source_path) # Write File (2) dest_path = "/tmp" ds.map_batches(..., batch_format="pandas").write_parquet(dest_path) # Read File Again (3) new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas") ``` At a high level we read, write, read. On a lower-level, we convert arrow blocks -> pandas -> arrow blocks -> pandas. We have connectors and registered extension types in `python/ray/air/util/tensor_extensions/`, however we special case handle tensor types by converting them to `TensorArrays` [here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238) when we convert pandas -> arrow. During this process, however, pyarrow will store metadata about the pandas block, which will look something like this: ```json { "name": "feature1", "field_name": "feature1", "pandas_type": "object", "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)", "metadata": null }, { "name": "feature2", "field_name": "feature2", "pandas_type": "object", "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)", "metadata": null } ``` For the most part this is fine, however, when converting _back_ to pandas, arrow will first attempt to search through the metadata("numpy_type") to restore the schema. This can be troublesome because pandas/numpy doesn't know how to handle those custom types. In pyarrow==14.0.0, this is an issue, because it surrenders the special casing to numpy/pandas in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this) ### Solution To handle older pyarrow versions, we can do `ignore_metadata=True` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

## Why are these changes needed? Consider the following the code ```python import ray # Read File (1) source_path = "file_that_contains_tensor_strings.parquet" ds = ray.data.read_parquet(source_path) # Write File (2) dest_path = "/tmp" ds.map_batches(..., batch_format="pandas").write_parquet(dest_path) # Read File Again (3) new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas") ``` At a high level we read, write, read. On a lower-level, we convert arrow blocks -> pandas -> arrow blocks -> pandas. We have connectors and registered extension types in `python/ray/air/util/tensor_extensions/`, however we special case handle tensor types by converting them to `TensorArrays` [here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238) when we convert pandas -> arrow. During this process, however, pyarrow will store metadata about the pandas block, which will look something like this: ```json { "name": "feature1", "field_name": "feature1", "pandas_type": "object", "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)", "metadata": null }, { "name": "feature2", "field_name": "feature2", "pandas_type": "object", "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)", "metadata": null } ``` For the most part this is fine, however, when converting _back_ to pandas, arrow will first attempt to search through the metadata("numpy_type") to restore the schema. This can be troublesome because pandas/numpy doesn't know how to handle those custom types. In pyarrow==14.0.0, this is an issue, because it surrenders the special casing to numpy/pandas in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this) ### Solution To handle older pyarrow versions, we can do `ignore_metadata=True` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

[data] ignore metadata for pandas block

1f7dcec

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/fix-tensor-strings branch from 7298367 to 1f7dcec Compare September 10, 2025 00:23

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Sep 10, 2025

iamjustinhsu marked this pull request as ready for review September 10, 2025 19:18

iamjustinhsu requested a review from a team as a code owner September 10, 2025 19:18

test

ed0a738

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner September 10, 2025 20:07

add comment

d49f3c3

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu commented Sep 10, 2025

View reviewed changes

lint

f6fc010

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/fix-tensor-strings branch from 52f2784 to f6fc010 Compare September 10, 2025 20:15

alexeykudinkin approved these changes Sep 10, 2025

View reviewed changes

python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved

add config param

4c9a4e1

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

alexeykudinkin approved these changes Sep 10, 2025

View reviewed changes

TimothySeah approved these changes Sep 10, 2025

View reviewed changes

alexeykudinkin merged commit 0047e72 into ray-project:master Sep 11, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] ignore metadata for pandas block#56402

[data] ignore metadata for pandas block#56402
alexeykudinkin merged 5 commits intoray-project:masterfrom
iamjustinhsu:jhsu/fix-tensor-strings

iamjustinhsu commented Sep 10, 2025 •

edited

Loading

Uh oh!

iamjustinhsu Sep 10, 2025

Uh oh!

Uh oh!

alexeykudinkin Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		DEFAULT_ENABLE_PANDAS_BLOCK = True

		DEFAULT_PANDAS_BLOCK_IGNORE_METADATA = bool(

Conversation

iamjustinhsu commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

NOTE

Solution

Related issue number

Checks

Uh oh!

iamjustinhsu Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iamjustinhsu commented Sep 10, 2025 •

edited

Loading