Skip to content

[data] Cherry pick data fixes for 2.49.1#56058

Merged
aslonnie merged 2 commits intoray-project:releases/2.49.1from
omatthew98:mowen/cherry-pick-data-fixes
Aug 28, 2025
Merged

[data] Cherry pick data fixes for 2.49.1#56058
aslonnie merged 2 commits intoray-project:releases/2.49.1from
omatthew98:mowen/cherry-pick-data-fixes

Conversation

@omatthew98
Copy link
Copy Markdown
Contributor

@omatthew98 omatthew98 commented Aug 28, 2025

Why are these changes needed?

Cherry pick two fixes for ray data (from #55854 and #55926).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several fixes and performance improvements for Ray Data, primarily around schema unification. Key changes include adding __hash__ methods to Arrow extension types to enable a fast path for unify_schemas when all schemas are identical, and replacing expensive schema unification calls with a more lightweight _take_first_non_empty_schema where appropriate. A new performance test suite for unify_schemas has also been added.

My review focuses on a few areas:

  • The removal of a test for duplicate schema fields, which could be a potential regression.
  • The use of a broad except Exception block which could be narrowed.
  • A minor style point about an inline import.

Overall, the changes look solid and should provide a good performance boost.

@omatthew98 omatthew98 requested a review from aslonnie August 28, 2025 20:25
iamjustinhsu and others added 2 commits August 28, 2025 13:55
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
results: TBD
<!-- Please give a short summary of the change and the problem this
solves. -->
- 2180 columns, 1000 schemas deduping is 1 second

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

- unification of schemas is slow
- revert back to pre https://github.com/ray-project/ray/pull/53454/files
commit. if no unification before, no unification after. if unification
before, we can leave it there or add it back. If I removed it I added a
comment with `# NOTE`
<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes ray-project#1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
@omatthew98 omatthew98 force-pushed the mowen/cherry-pick-data-fixes branch from 47761ab to eb0973d Compare August 28, 2025 20:55
@omatthew98 omatthew98 added the go add ONLY when ready to merge, run all tests label Aug 28, 2025
@aslonnie aslonnie merged commit c057f1e into ray-project:releases/2.49.1 Aug 28, 2025
5 of 6 checks passed
@fog-ketian
Copy link
Copy Markdown

FYI - In case anyone else saw tons of warnings like this:

WARNING transform_pyarrow.py:181 -- Failed to hash the schemas (for deduplication): unhashable type: 'dict'

when using Ray data to load Parquet datasets with schema metadata after upgrading to Ray 2.49.1.

It's because PyArrow is sending the schema metadata dict to the hash func directly.
There is a PR to fix this issue on arrow side: apache/arrow#47601.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants