Skip to content

[data] ignore metadata for pandas block#56402

Merged
alexeykudinkin merged 5 commits intoray-project:masterfrom
iamjustinhsu:jhsu/fix-tensor-strings
Sep 11, 2025
Merged

[data] ignore metadata for pandas block#56402
alexeykudinkin merged 5 commits intoray-project:masterfrom
iamjustinhsu:jhsu/fix-tensor-strings

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu commented Sep 10, 2025

Why are these changes needed?

Consider the following the code

import ray
# Read File (1)
source_path = "file_that_contains_tensor_strings.parquet"
ds = ray.data.read_parquet(source_path)
# Write File (2)
dest_path = "/tmp"
ds.map_batches(..., batch_format="pandas").write_parquet(dest_path)

# Read File Again (3)
new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas")

At a high level we read, write, read. On a lower-level, we convert arrow blocks -> pandas -> arrow blocks -> pandas. We have connectors and registered extension types in python/ray/air/util/tensor_extensions/, however we special case handle tensor types by converting them to TensorArrays here when we convert pandas -> arrow. During this process, however, pyarrow will store metadata about the pandas block, which will look something like this:

{
    "name": "feature1",
    "field_name": "feature1",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)",
    "metadata": null
},
{
    "name": "feature2",
    "field_name": "feature2",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)",
    "metadata": null
}

For the most part this is fine, however, when converting back to pandas, arrow will first attempt to search through the metadata("numpy_type") to restore the schema. This can be troublesome because pandas/numpy doesn't know how to handle those custom types.

In pyarrow==14.0.0, this is an issue, because it surrenders the special casing to numpy/pandas
in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this)

NOTE

This has been tested with pyarrow==14.0.0 (ray version 2.45) and this works.

Solution

To handle older pyarrow versions, we can do ignore_metadata=True

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/fix-tensor-strings branch from 7298367 to 1f7dcec Compare September 10, 2025 00:23
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Sep 10, 2025
@iamjustinhsu iamjustinhsu marked this pull request as ready for review September 10, 2025 19:18
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner September 10, 2025 19:18
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner September 10, 2025 20:07
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
arrow_table = pa.Table.from_pandas(df_pandas)

# Convert back to pandas
df_roundtrip = arrow_table.to_pandas(ignore_metadata=True)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed this will fail without ignore_metadata=True. I wrote this test here to show that iamjustinhsu#3 will solve the issue.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/fix-tensor-strings branch from 52f2784 to f6fc010 Compare September 10, 2025 20:15
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

DEFAULT_ENABLE_PANDAS_BLOCK = True

DEFAULT_PANDAS_BLOCK_IGNORE_METADATA = bool(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use env_bool

@alexeykudinkin alexeykudinkin merged commit 0047e72 into ray-project:master Sep 11, 2025
5 checks passed
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Consider the following the code
```python
import ray
# Read File (1)
source_path = "file_that_contains_tensor_strings.parquet"
ds = ray.data.read_parquet(source_path)
# Write File (2)
dest_path = "/tmp"
ds.map_batches(..., batch_format="pandas").write_parquet(dest_path)

# Read File Again (3)
new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas")
```
At a high level we read, write, read. On a lower-level, we convert arrow
blocks -> pandas -> arrow blocks -> pandas. We have connectors and
registered extension types in `python/ray/air/util/tensor_extensions/`,
however we special case handle tensor types by converting them to
`TensorArrays`
[here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238)
when we convert pandas -> arrow. During this process, however, pyarrow
will store metadata about the pandas block, which will look something
like this:
```json
{
    "name": "feature1",
    "field_name": "feature1",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)",
    "metadata": null
},
{
    "name": "feature2",
    "field_name": "feature2",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)",
    "metadata": null
}
```
For the most part this is fine, however, when converting _back_ to
pandas, arrow will first attempt to search through the
metadata("numpy_type") to restore the schema. This can be troublesome
because pandas/numpy doesn't know how to handle those custom types.

In pyarrow==14.0.0, this is an issue, because it surrenders the special
casing to numpy/pandas
in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this) 

### Solution
To handle older pyarrow versions, we can do `ignore_metadata=True`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Consider the following the code
```python
import ray
# Read File (1)
source_path = "file_that_contains_tensor_strings.parquet"
ds = ray.data.read_parquet(source_path)
# Write File (2)
dest_path = "/tmp"
ds.map_batches(..., batch_format="pandas").write_parquet(dest_path)

# Read File Again (3)
new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas")
```
At a high level we read, write, read. On a lower-level, we convert arrow
blocks -> pandas -> arrow blocks -> pandas. We have connectors and
registered extension types in `python/ray/air/util/tensor_extensions/`,
however we special case handle tensor types by converting them to
`TensorArrays`
[here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238)
when we convert pandas -> arrow. During this process, however, pyarrow
will store metadata about the pandas block, which will look something
like this:
```json
{
    "name": "feature1",
    "field_name": "feature1",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)",
    "metadata": null
},
{
    "name": "feature2",
    "field_name": "feature2",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)",
    "metadata": null
}
```
For the most part this is fine, however, when converting _back_ to
pandas, arrow will first attempt to search through the
metadata("numpy_type") to restore the schema. This can be troublesome
because pandas/numpy doesn't know how to handle those custom types.

In pyarrow==14.0.0, this is an issue, because it surrenders the special
casing to numpy/pandas
in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this)

### Solution
To handle older pyarrow versions, we can do `ignore_metadata=True`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Consider the following the code
```python
import ray
# Read File (1)
source_path = "file_that_contains_tensor_strings.parquet"
ds = ray.data.read_parquet(source_path)
# Write File (2)
dest_path = "/tmp"
ds.map_batches(..., batch_format="pandas").write_parquet(dest_path)

# Read File Again (3)
new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas")
```
At a high level we read, write, read. On a lower-level, we convert arrow
blocks -> pandas -> arrow blocks -> pandas. We have connectors and
registered extension types in `python/ray/air/util/tensor_extensions/`,
however we special case handle tensor types by converting them to
`TensorArrays`
[here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238)
when we convert pandas -> arrow. During this process, however, pyarrow
will store metadata about the pandas block, which will look something
like this:
```json
{
    "name": "feature1",
    "field_name": "feature1",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)",
    "metadata": null
},
{
    "name": "feature2",
    "field_name": "feature2",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)",
    "metadata": null
}
```
For the most part this is fine, however, when converting _back_ to
pandas, arrow will first attempt to search through the
metadata("numpy_type") to restore the schema. This can be troublesome
because pandas/numpy doesn't know how to handle those custom types.

In pyarrow==14.0.0, this is an issue, because it surrenders the special
casing to numpy/pandas
in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this) 

### Solution
To handle older pyarrow versions, we can do `ignore_metadata=True`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Consider the following the code
```python
import ray
# Read File (1)
source_path = "file_that_contains_tensor_strings.parquet"
ds = ray.data.read_parquet(source_path)
# Write File (2)
dest_path = "/tmp"
ds.map_batches(..., batch_format="pandas").write_parquet(dest_path)

# Read File Again (3)
new_ds = ray.data.read_parquet(dest_path).map_bataches(..., batch_format="pandas")
```
At a high level we read, write, read. On a lower-level, we convert arrow
blocks -> pandas -> arrow blocks -> pandas. We have connectors and
registered extension types in `python/ray/air/util/tensor_extensions/`,
however we special case handle tensor types by converting them to
`TensorArrays`
[here](https://github.com/iamjustinhsu/ray/blob/1f7dcec413bf9aba3ac39c0a14d7d4b734a1939f/python/ray/data/_internal/pandas_block.py#L238)
when we convert pandas -> arrow. During this process, however, pyarrow
will store metadata about the pandas block, which will look something
like this:
```json
{
    "name": "feature1",
    "field_name": "feature1",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8, 2), dtype=<U38)",
    "metadata": null
},
{
    "name": "feature2",
    "field_name": "feature2",
    "pandas_type": "object",
    "numpy_type": "numpy.ndarray(shape=(8,), dtype=float32)",
    "metadata": null
}
```
For the most part this is fine, however, when converting _back_ to
pandas, arrow will first attempt to search through the
metadata("numpy_type") to restore the schema. This can be troublesome
because pandas/numpy doesn't know how to handle those custom types.

In pyarrow==14.0.0, this is an issue, because it surrenders the special
casing to numpy/pandas
in pyarrow==21.0.0, it's smarter and DOES handle that (I tested this) 

### Solution
To handle older pyarrow versions, we can do `ignore_metadata=True`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants