[data] Cherry pick data fixes for 2.49.1 by omatthew98 · Pull Request #56058 · ray-project/ray

omatthew98 · 2025-08-28T20:23:45Z

Why are these changes needed?

Cherry pick two fixes for ray data (from #55854 and #55926).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request introduces several fixes and performance improvements for Ray Data, primarily around schema unification. Key changes include adding __hash__ methods to Arrow extension types to enable a fast path for unify_schemas when all schemas are identical, and replacing expensive schema unification calls with a more lightweight _take_first_non_empty_schema where appropriate. A new performance test suite for unify_schemas has also been added.

My review focuses on a few areas:

The removal of a test for duplicate schema fields, which could be a potential regression.
The use of a broad except Exception block which could be narrowed.
A minor style point about an inline import.

Overall, the changes look solid and should provide a good performance boost.

## Why are these changes needed? results: TBD  - 2180 columns, 1000 schemas deduping is 1 second ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Matthew Owen <mowen@anyscale.com>

- unification of schemas is slow - revert back to pre https://github.com/ray-project/ray/pull/53454/files commit. if no unification before, no unification after. if unification before, we can leave it there or add it back. If I removed it I added a comment with `# NOTE`   - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Matthew Owen <mowen@anyscale.com>

fog-ketian · 2025-09-29T03:08:38Z

FYI - In case anyone else saw tons of warnings like this:

WARNING transform_pyarrow.py:181 -- Failed to hash the schemas (for deduplication): unhashable type: 'dict'

when using Ray data to load Parquet datasets with schema metadata after upgrading to Ray 2.49.1.

It's because PyArrow is sending the schema metadata dict to the hash func directly.
There is a PR to fix this issue on arrow side: apache/arrow#47601.

omatthew98 requested review from a team as code owners August 28, 2025 20:23

omatthew98 requested review from goutamvenkat-anyscale and iamjustinhsu August 28, 2025 20:25

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

omatthew98 requested a review from aslonnie August 28, 2025 20:25

iamjustinhsu and others added 2 commits August 28, 2025 13:55

omatthew98 force-pushed the mowen/cherry-pick-data-fixes branch from 47761ab to eb0973d Compare August 28, 2025 20:55

omatthew98 added the go add ONLY when ready to merge, run all tests label Aug 28, 2025

aslonnie merged commit c057f1e into ray-project:releases/2.49.1 Aug 28, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Cherry pick data fixes for 2.49.1#56058

[data] Cherry pick data fixes for 2.49.1#56058
aslonnie merged 2 commits intoray-project:releases/2.49.1from
omatthew98:mowen/cherry-pick-data-fixes

omatthew98 commented Aug 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

fog-ketian commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

omatthew98 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

fog-ketian commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

omatthew98 commented Aug 28, 2025 •

edited

Loading