[Data] - Improve performance for `unify_schemas` by goutamvenkat-anyscale · Pull Request #55880 · ray-project/ray

goutamvenkat-anyscale · 2025-08-25T09:33:44Z

Why are these changes needed?

Find all diverging schemas, coalesce them if possible, and do so recursively in the presence of structs.
Perform a single pass to gather stats for all columns across all schemas.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Goutam V <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request significantly improves the performance of unify_schemas by refactoring it to use a single pass for gathering column statistics. The new implementation is not only faster but also more readable and maintainable. The use of a ColAgg dataclass to hold column statistics is a clean approach. I've found one potential issue with override precedence that could lead to incorrect type unification in some cases. Otherwise, this is an excellent improvement.

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V <goutam@anyscale.com>

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V <goutam@anyscale.com>

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V <goutam@anyscale.com>

srinathk10

May be good to add in a unify_schema test case on lots of schema (10) and wide schemas (10k) with CI assuming it all get's done < 1sec.

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V <goutam@anyscale.com>

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V <goutam@anyscale.com>

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin · 2025-09-08T18:22:06Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

+    schemas[0].remove_metadata()
+    schemas_to_unify = [schemas[0]]
+    for schema in schemas[1:]:
+        schema.remove_metadata()
+        if not schema.equals(schemas[0]):


nit: Let's actually do a set and (later we can raise a PR in Pyarrow to start caching the hashes)

I'll use dict.fromkeys() instead to preserve ordering.

Actually spark schemas are dicts and they're unhashable. Fails this test: test_raydp: df = ds.to_spark(spark)

Input has to be PA schema, right?

If you look at this stack trace:

[2025-09-10T22:10:49Z] _____________________________ test_raydp_roundtrip _____________________________ -- | [2025-09-10T22:10:49Z] | [2025-09-10T22:10:49Z] spark = <pyspark.sql.session.SparkSession object at 0x7f086c7c2190> | [2025-09-10T22:10:49Z] | [2025-09-10T22:10:49Z] def test_raydp_roundtrip(spark): | [2025-09-10T22:10:49Z] spark_df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["one", "two"]) | [2025-09-10T22:10:49Z] rows = [(r.one, r.two) for r in spark_df.take(3)] | [2025-09-10T22:10:49Z] ds = ray.data.from_spark(spark_df) | [2025-09-10T22:10:49Z] values = [(r["one"], r["two"]) for r in ds.take(6)] | [2025-09-10T22:10:49Z] assert values == rows | [2025-09-10T22:10:49Z] > df = ds.to_spark(spark) | [2025-09-10T22:10:49Z] | [2025-09-10T22:10:49Z] python/ray/data/tests/test_raydp.py:30: | [2025-09-10T22:10:49Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | [2025-09-10T22:10:49Z] /rayci/python/ray/data/dataset.py:5594: in to_spark | [2025-09-10T22:10:49Z] schema = self.schema() | [2025-09-10T22:10:49Z] /rayci/python/ray/data/dataset.py:3459: in schema | [2025-09-10T22:10:49Z] base_schema = self._plan.schema(fetch_if_missing=False) | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/plan.py:395: in schema | [2025-09-10T22:10:49Z] schema = self._logical_plan.dag.infer_schema() | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/logical/operators/from_operators.py:77: in infer_schema | [2025-09-10T22:10:49Z] return unify_ref_bundles_schema(self._input_data) | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/util.py:791: in unify_ref_bundles_schema | [2025-09-10T22:10:49Z] return unify_schemas_with_validation(schemas_to_unify) | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/util.py:775: in unify_schemas_with_validation | [2025-09-10T22:10:49Z] return unify_schemas(schemas_to_unify, promote_types=True) | [2025-09-10T22:10:49Z] /rayci/python/ray/data/_internal/arrow_ops/transform_pyarrow.py:325: in unify_schemas | [2025-09-10T22:10:49Z] schemas_to_unify = list(dict.fromkeys(schemas)) | [2025-09-10T22:10:49Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | [2025-09-10T22:10:49Z] | [2025-09-10T22:10:49Z] > ??? | [2025-09-10T22:10:49Z] E TypeError: unhashable type: 'dict'

It seems that the schema becomes a dict.infer_schema() seems to be the one that converts it.

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-09-11T01:40:39Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

+        # If we raise only on non tensor errors, it fails to unify PythonObjectType and pyarrow primitives.
+        # Look at test_pyarrow_conversion_error_handling for an example.


@alexeykudinkin just fyi

Ack. What do exceptions look like in this cases?

I want to limit the scope of it as much as possible

pyarrow.lib.ArrowTypeError: Unable to merge: Field my_data has incompatible types: string vs extension<ray.data.arrow_pickled_object<ArrowPythonObjectType>>

alexeykudinkin · 2025-09-11T18:14:17Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

+    schemas[0].remove_metadata()
+    schemas_to_unify = [schemas[0]]
+    for schema in schemas[1:]:
+        schema.remove_metadata()
+        if not schema.equals(schemas[0]):


Input has to be PA schema, right?

alexeykudinkin · 2025-09-11T18:16:59Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

+        # If we raise only on non tensor errors, it fails to unify PythonObjectType and pyarrow primitives.
+        # Look at test_pyarrow_conversion_error_handling for an example.


Ack. What do exceptions look like in this cases?

I want to limit the scope of it as much as possible

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

alexeykudinkin

LGTM, minor comments

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

alexeykudinkin · 2025-09-11T18:36:18Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

+            if not (pyarrow.types.is_list(t) and pyarrow.types.is_null(t.value_type)):
+                return t
+    # Let PyArrow handle other cases
+    return None


What does this mean?

At this phase, it will error out because Arrow can't handle the case and we can't reconcile either. I'll clarify the comment.

Signed-off-by: Goutam V. <goutam@anyscale.com>

## Why are these changes needed? Find all diverging schemas, coalesce them if possible, and do so recursively in the presence of structs. Perform a single pass to gather stats for all columns across all schemas. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V <goutam@anyscale.com> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

## Why are these changes needed? Find all diverging schemas, coalesce them if possible, and do so recursively in the presence of structs. Perform a single pass to gather stats for all columns across all schemas. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V <goutam@anyscale.com> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>

## Why are these changes needed? Find all diverging schemas, coalesce them if possible, and do so recursively in the presence of structs. Perform a single pass to gather stats for all columns across all schemas. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V <goutam@anyscale.com> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

## Why are these changes needed? Find all diverging schemas, coalesce them if possible, and do so recursively in the presence of structs. Perform a single pass to gather stats for all columns across all schemas. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V <goutam@anyscale.com> Signed-off-by: Goutam V. <goutam@anyscale.com>

[Data] - Improve performance for

c795440

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner August 25, 2025 09:33

gemini-code-assist bot reviewed Aug 25, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

fix import

6e85864

Signed-off-by: Goutam V <goutam@anyscale.com>

ray-gardener bot added the data Ray Data-related issues label Aug 25, 2025

Clean up comment

c5b6e81

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Aug 25, 2025

goutamvenkat-anyscale added 2 commits August 25, 2025 13:00

tfrecords test

b019e33

Signed-off-by: Goutam V <goutam@anyscale.com>

try tf record fix again

07842dc

Signed-off-by: Goutam V <goutam@anyscale.com>

alexeykudinkin reviewed Aug 25, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits August 26, 2025 01:44

Diverging schema approach

8d476ef

Signed-off-by: Goutam V <goutam@anyscale.com>

Clean up

a7ab309

Signed-off-by: Goutam V <goutam@anyscale.com>

alexeykudinkin reviewed Aug 26, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 4 commits August 26, 2025 15:24

do traversal + coalescing in 2 separate steps

4d2fc31

Signed-off-by: Goutam V <goutam@anyscale.com>

Keep linter happy

2d2a2cb

Signed-off-by: Goutam V <goutam@anyscale.com>

Cleanup

d84110e

Signed-off-by: Goutam V <goutam@anyscale.com>

One more clean up

01733f6

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner August 27, 2025 20:25

goutamvenkat-anyscale added 2 commits August 27, 2025 16:35

Merge from master

2914cbf

Signed-off-by: Goutam V <goutam@anyscale.com>

Merge from master

d74d6b2

Signed-off-by: Goutam V <goutam@anyscale.com>

srinathk10 reviewed Aug 29, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Aug 29, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

Address comments

01ae9ec

Signed-off-by: Goutam V <goutam@anyscale.com>

alexeykudinkin reviewed Aug 30, 2025

View reviewed changes

Clean up + Address comments

630329b

Signed-off-by: Goutam V <goutam@anyscale.com>

iamjustinhsu reviewed Sep 3, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 4 commits September 3, 2025 13:52

Merge branch 'master' into goutam/fix_schema_unification

b988f9f

Merge from master + merge conflicts

723f4ca

Signed-off-by: Goutam V. <goutam@anyscale.com>

One simplification

dc0a969

Signed-off-by: Goutam V. <goutam@anyscale.com>

doclint

74a48e4

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale added 4 commits September 4, 2025 18:17

Address some comments

72cfc25

Signed-off-by: Goutam V. <goutam@anyscale.com>

one more cleanup

1d4cdc6

Signed-off-by: Goutam V. <goutam@anyscale.com>

try catch

9b645a6

Signed-off-by: Goutam V. <goutam@anyscale.com>

Include more errors

ae8a776

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin reviewed Sep 9, 2025

View reviewed changes

goutamvenkat-anyscale added 7 commits September 10, 2025 12:32

Reconciliation and divergence in one pass

f97d7c6

Signed-off-by: Goutam V. <goutam@anyscale.com>

Cleanup

02c80c7

Signed-off-by: Goutam V. <goutam@anyscale.com>

Use list(dict.from_keys)

5eaae8e

Signed-off-by: Goutam V. <goutam@anyscale.com>

Merge branch 'master' into goutam/fix_schema_unification

73a6213

spark schemas are dicts - unhashable

91ad0f1

Signed-off-by: Goutam V. <goutam@anyscale.com>

Have to handle null list fields for pyarrow = 9

a8023f5

Signed-off-by: Goutam V. <goutam@anyscale.com>

Fix test

07b5ecd

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale commented Sep 11, 2025

View reviewed changes

alexeykudinkin reviewed Sep 11, 2025

View reviewed changes

goutamvenkat-anyscale added 4 commits September 11, 2025 12:37

Fix + Address comments

32f5ba6

Signed-off-by: Goutam V. <goutam@anyscale.com>

One more clean up

a475a97

Signed-off-by: Goutam V. <goutam@anyscale.com>

tiny cleanup

d00bb1f

Signed-off-by: Goutam V. <goutam@anyscale.com>

pytest skip

0935ac2

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin approved these changes Sep 12, 2025

View reviewed changes

Add cache for field

94f895c

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin approved these changes Sep 12, 2025

View reviewed changes

alexeykudinkin enabled auto-merge (squash) September 12, 2025 05:27

alexeykudinkin merged commit df5951e into ray-project:master Sep 12, 2025
6 checks passed

goutamvenkat-anyscale deleted the goutam/fix_schema_unification branch September 12, 2025 17:36

		# If we raise only on non tensor errors, it fails to unify PythonObjectType and pyarrow primitives.
		# Look at test_pyarrow_conversion_error_handling for an example.

Conversation

goutamvenkat-anyscale commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

goutamvenkat-anyscale commented Aug 25, 2025 •

edited

Loading