[Data][Serialization] - Use Arrow IPC for Arrow Schema Serdes by goutamvenkat-anyscale · Pull Request #60195 · ray-project/ray

goutamvenkat-anyscale · 2026-01-16T01:40:52Z

Description

Ray Core was using cloudpickle instead of Arrow IPC as the serialization mechanism, and that takes a toll on Ray Data since Ray Data keeps track of the materialized BlockMetadata on the driver.

Benchmark Results:

============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000
============================================================

Method	Serialize (ms)	Deserialize (ms)	Size (bytes)
Arrow IPC	238.65	667.79	6856
cloudpickle	5985.70	1633.24	4710
ray.cloudpickle (with Cloudpickle)	6036.41	1623.84	4710
ray.cloudpickle (with Arrow IPC)	308.53	706.48	6942

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a performance optimization for pyarrow.Schema serialization by using Arrow's native IPC format instead of cloudpickle. The implementation is clean and is accompanied by a comprehensive test case that validates the new serialization mechanism.

I have one suggestion to further improve performance by avoiding an unnecessary memory copy during serialization. This involves using pyarrow.Buffer directly instead of converting to a bytes object. I've left detailed comments on the implementation and the corresponding test file.

gemini-code-assist · 2026-01-16T01:42:00Z

python/ray/_private/arrow_serialization.py

+    return _restore_schema_from_ipc, (schema.serialize().to_pybytes(),)
+
+
+def _restore_schema_from_ipc(buf: bytes) -> "pyarrow.Schema":
+    """Restore an Arrow Schema serialized to Arrow IPC format."""
+    import pyarrow as pa
+
+    return pa.ipc.read_schema(pa.BufferReader(buf))


To further improve performance and avoid an unnecessary memory copy, you can directly use the pyarrow.Buffer returned by schema.serialize() instead of converting it to a Python bytes object. pyarrow.Buffer objects are picklable and can be transferred efficiently.

This involves two small changes:

In _arrow_schema_reduce, remove the .to_pybytes() call.

In _restore_schema_from_ipc, update the type hint for buf and remove the pa.BufferReader wrapper, as pa.ipc.read_schema can directly handle a pyarrow.Buffer.

Suggested change

return _restore_schema_from_ipc, (schema.serialize().to_pybytes(),)

def _restore_schema_from_ipc(buf: bytes) -> "pyarrow.Schema":

"""Restore an Arrow Schema serialized to Arrow IPC format."""

import pyarrow as pa

return pa.ipc.read_schema(pa.BufferReader(buf))

return _restore_schema_from_ipc, (schema.serialize(),)

def _restore_schema_from_ipc(buf: "pyarrow.Buffer") -> "pyarrow.Schema":

"""Restore an Arrow Schema serialized to Arrow IPC format."""

import pyarrow as pa

return pa.ipc.read_schema(buf)

============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000

Method Serialize (ms) Deserialize (ms) Size (bytes)

IPC 236.56 665.39 6856
cloudpickle 5939.44 1623.03 4710
ray.cloudpickle 359.02 706.08 6978

Speedup vs IPC (slower = Nx):
cloudpickle: Serialize: 25.11x Deserialize: 2.44x Size: 0.69x
ray.cloudpickle: Serialize: 1.52x Deserialize: 1.06x Size: 1.02x

============================================================

Ends up being worse with pa.Buffer

Interesting

python/ray/data/tests/test_arrow_serialization.py

Signed-off-by: Goutam <goutam@anyscale.com>

…oject#60195) ## Description Ray Core was using `cloudpickle` instead of `Arrow IPC` as the serialization mechanism, and that takes a toll on Ray Data since Ray Data keeps track of the materialized `BlockMetadata` on the driver. Benchmark Results: ``` ============================================================ Schema: Large Schema (100 fields, mixed types) (100 fields) Iterations: 10,000 ============================================================ ``` | Method | Serialize (ms) | Deserialize (ms) | Size (bytes) | |--------|----------------|------------------|--------------| | Arrow IPC | 238.65 | 667.79 | 6856 | | cloudpickle | 5985.70 | 1633.24 | 4710 | | ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84 | 4710 | | ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 | ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

…oject#60195) ## Description Ray Core was using `cloudpickle` instead of `Arrow IPC` as the serialization mechanism, and that takes a toll on Ray Data since Ray Data keeps track of the materialized `BlockMetadata` on the driver. Benchmark Results: ``` ============================================================ Schema: Large Schema (100 fields, mixed types) (100 fields) Iterations: 10,000 ============================================================ ``` | Method | Serialize (ms) | Deserialize (ms) | Size (bytes) | |--------|----------------|------------------|--------------| | Arrow IPC | 238.65 | 667.79 | 6856 | | cloudpickle | 5985.70 | 1633.24 | 4710 | | ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84 | 4710 | | ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 | ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

…oject#60195) ## Description Ray Core was using `cloudpickle` instead of `Arrow IPC` as the serialization mechanism, and that takes a toll on Ray Data since Ray Data keeps track of the materialized `BlockMetadata` on the driver. Benchmark Results: ``` ============================================================ Schema: Large Schema (100 fields, mixed types) (100 fields) Iterations: 10,000 ============================================================ ``` | Method | Serialize (ms) | Deserialize (ms) | Size (bytes) | |--------|----------------|------------------|--------------| | Arrow IPC | 238.65 | 667.79 | 6856 | | cloudpickle | 5985.70 | 1633.24 | 4710 | | ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84 | 4710 | | ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 | ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

…oject#60195) ## Description Ray Core was using `cloudpickle` instead of `Arrow IPC` as the serialization mechanism, and that takes a toll on Ray Data since Ray Data keeps track of the materialized `BlockMetadata` on the driver. Benchmark Results: ``` ============================================================ Schema: Large Schema (100 fields, mixed types) (100 fields) Iterations: 10,000 ============================================================ ``` | Method | Serialize (ms) | Deserialize (ms) | Size (bytes) | |--------|----------------|------------------|--------------| | Arrow IPC | 238.65 | 667.79 | 6856 | | cloudpickle | 5985.70 | 1633.24 | 4710 | | ray.cloudpickle (with Cloudpickle) | 6036.41 | 1623.84 | 4710 | | ray.cloudpickle (with Arrow IPC) | 308.53 | 706.48 | 6942 | ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[Data][Serialization] - Use Arrow IPC for Arrow Schema Serdes

0f415f0

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested review from a team as code owners January 16, 2026 01:40

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

goutamvenkat-anyscale added data Ray Data-related issues core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Jan 16, 2026

Add type annotation

00b81b4

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin approved these changes Jan 16, 2026

View reviewed changes

alexeykudinkin enabled auto-merge (squash) January 16, 2026 18:22

israbbani approved these changes Jan 16, 2026

View reviewed changes

alexeykudinkin merged commit 69ad7b9 into ray-project:master Jan 16, 2026
7 checks passed

goutamvenkat-anyscale deleted the goutam/arrow_schema_speedup branch January 16, 2026 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][Serialization] - Use Arrow IPC for Arrow Schema Serdes#60195

[Data][Serialization] - Use Arrow IPC for Arrow Schema Serdes#60195
alexeykudinkin merged 2 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/arrow_schema_speedup

goutamvenkat-anyscale commented Jan 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

goutamvenkat-anyscale Jan 16, 2026

Uh oh!

alexeykudinkin Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

goutamvenkat-anyscale commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Jan 16, 2026

Choose a reason for hiding this comment

============================================================ Schema: Large Schema (100 fields, mixed types) (100 fields) Iterations: 10,000

Method Serialize (ms) Deserialize (ms) Size (bytes)

Uh oh!

alexeykudinkin Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goutamvenkat-anyscale commented Jan 16, 2026 •

edited

Loading

============================================================
Schema: Large Schema (100 fields, mixed types) (100 fields)
Iterations: 10,000