Skip to content

[Data][Serialization] - Use Arrow IPC for Arrow Schema Serdes#60195

Merged
alexeykudinkin merged 2 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/arrow_schema_speedup
Jan 16, 2026
Merged

[Data][Serialization] - Use Arrow IPC for Arrow Schema Serdes#60195
alexeykudinkin merged 2 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/arrow_schema_speedup

Conversation

@goutamvenkat-anyscale
Copy link
Copy Markdown
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Jan 16, 2026

Description

Ray Core was using cloudpickle instead of Arrow IPC as the serialization mechanism, and that takes a toll on Ray Data since Ray Data keeps track of the materialized BlockMetadata on the driver.

Benchmark Results:

============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================
Method Serialize (ms) Deserialize (ms) Size (bytes)
Arrow IPC 238.65 667.79 6856
cloudpickle 5985.70 1633.24 4710
ray.cloudpickle (with Cloudpickle) 6036.41 1623.84 4710
ray.cloudpickle (with Arrow IPC) 308.53 706.48 6942

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested review from a team as code owners January 16, 2026 01:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for pyarrow.Schema serialization by using Arrow's native IPC format instead of cloudpickle. The implementation is clean and is accompanied by a comprehensive test case that validates the new serialization mechanism.

I have one suggestion to further improve performance by avoiding an unnecessary memory copy during serialization. This involves using pyarrow.Buffer directly instead of converting to a bytes object. I've left detailed comments on the implementation and the corresponding test file.

Comment on lines +122 to +129
return _restore_schema_from_ipc, (schema.serialize().to_pybytes(),)


def _restore_schema_from_ipc(buf: bytes) -> "pyarrow.Schema":
"""Restore an Arrow Schema serialized to Arrow IPC format."""
import pyarrow as pa

return pa.ipc.read_schema(pa.BufferReader(buf))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To further improve performance and avoid an unnecessary memory copy, you can directly use the pyarrow.Buffer returned by schema.serialize() instead of converting it to a Python bytes object. pyarrow.Buffer objects are picklable and can be transferred efficiently.

This involves two small changes:

  1. In _arrow_schema_reduce, remove the .to_pybytes() call.
  2. In _restore_schema_from_ipc, update the type hint for buf and remove the pa.BufferReader wrapper, as pa.ipc.read_schema can directly handle a pyarrow.Buffer.
Suggested change
return _restore_schema_from_ipc, (schema.serialize().to_pybytes(),)
def _restore_schema_from_ipc(buf: bytes) -> "pyarrow.Schema":
"""Restore an Arrow Schema serialized to Arrow IPC format."""
import pyarrow as pa
return pa.ipc.read_schema(pa.BufferReader(buf))
return _restore_schema_from_ipc, (schema.serialize(),)
def _restore_schema_from_ipc(buf: "pyarrow.Buffer") -> "pyarrow.Schema":
"""Restore an Arrow Schema serialized to Arrow IPC format."""
import pyarrow as pa
return pa.ipc.read_schema(buf)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000

Method Serialize (ms) Deserialize (ms) Size (bytes)

IPC 236.56 665.39 6856
cloudpickle 5939.44 1623.03 4710
ray.cloudpickle 359.02 706.08 6978

Speedup vs IPC (slower = Nx):
cloudpickle: Serialize: 25.11x Deserialize: 2.44x Size: 0.69x
ray.cloudpickle: Serialize: 1.52x Deserialize: 1.06x Size: 1.02x

============================================================

Ends up being worse with pa.Buffer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting

@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Jan 16, 2026
Signed-off-by: Goutam <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) January 16, 2026 18:22
@alexeykudinkin alexeykudinkin merged commit 69ad7b9 into ray-project:master Jan 16, 2026
7 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/arrow_schema_speedup branch January 16, 2026 19:39
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 18, 2026
…oject#60195)

## Description
Ray Core was using `cloudpickle` instead of `Arrow IPC` as the
serialization mechanism, and that takes a toll on Ray Data since Ray
Data keeps track of the materialized `BlockMetadata` on the driver.

Benchmark Results:

```
============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================
```

| Method | Serialize (ms) | Deserialize (ms) | Size (bytes) |
|--------|----------------|------------------|--------------|
| Arrow IPC | 238.65 | 667.79 | 6856 |
| cloudpickle | 5985.70 | 1633.24  | 4710 |
| ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84  | 4710 |
| ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 |

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…oject#60195)

## Description
Ray Core was using `cloudpickle` instead of `Arrow IPC` as the
serialization mechanism, and that takes a toll on Ray Data since Ray
Data keeps track of the materialized `BlockMetadata` on the driver.

Benchmark Results:

```
============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================
```

| Method | Serialize (ms) | Deserialize (ms) | Size (bytes) |
|--------|----------------|------------------|--------------|
| Arrow IPC | 238.65 | 667.79 | 6856 |
| cloudpickle | 5985.70 | 1633.24  | 4710 |
| ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84  | 4710 |
| ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 |

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…oject#60195)

## Description
Ray Core was using `cloudpickle` instead of `Arrow IPC` as the
serialization mechanism, and that takes a toll on Ray Data since Ray
Data keeps track of the materialized `BlockMetadata` on the driver.


Benchmark Results:

```
============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================
```

| Method | Serialize (ms) | Deserialize (ms) | Size (bytes) |
|--------|----------------|------------------|--------------|
| Arrow IPC | 238.65 | 667.79 | 6856 |
| cloudpickle | 5985.70 | 1633.24  | 4710 |
| ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84  | 4710 |
| ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 |

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…oject#60195)

## Description
Ray Core was using `cloudpickle` instead of `Arrow IPC` as the
serialization mechanism, and that takes a toll on Ray Data since Ray
Data keeps track of the materialized `BlockMetadata` on the driver.

Benchmark Results:

```
============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================
```

| Method | Serialize (ms) | Deserialize (ms) | Size (bytes) |
|--------|----------------|------------------|--------------|
| Arrow IPC | 238.65 | 667.79 | 6856 |
| cloudpickle | 5985.70 | 1633.24  | 4710 |
| ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84  | 4710 |
| ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 |

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…oject#60195)

## Description
Ray Core was using `cloudpickle` instead of `Arrow IPC` as the
serialization mechanism, and that takes a toll on Ray Data since Ray
Data keeps track of the materialized `BlockMetadata` on the driver.

Benchmark Results:

```
============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================
```

| Method | Serialize (ms) | Deserialize (ms) | Size (bytes) |
|--------|----------------|------------------|--------------|
| Arrow IPC | 238.65 | 667.79 | 6856 |
| cloudpickle | 5985.70 | 1633.24  | 4710 |
| ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84  | 4710 |
| ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 |

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants