[data] Cherry pick data fixes for 2.49.1#56058
[data] Cherry pick data fixes for 2.49.1#56058aslonnie merged 2 commits intoray-project:releases/2.49.1from
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces several fixes and performance improvements for Ray Data, primarily around schema unification. Key changes include adding __hash__ methods to Arrow extension types to enable a fast path for unify_schemas when all schemas are identical, and replacing expensive schema unification calls with a more lightweight _take_first_non_empty_schema where appropriate. A new performance test suite for unify_schemas has also been added.
My review focuses on a few areas:
- The removal of a test for duplicate schema fields, which could be a potential regression.
- The use of a broad
except Exceptionblock which could be narrowed. - A minor style point about an inline import.
Overall, the changes look solid and should provide a good performance boost.
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? results: TBD <!-- Please give a short summary of the change and the problem this solves. --> - 2180 columns, 1000 schemas deduping is 1 second ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Matthew Owen <mowen@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> - unification of schemas is slow - revert back to pre https://github.com/ray-project/ray/pull/53454/files commit. if no unification before, no unification after. if unification before, we can leave it there or add it back. If I removed it I added a comment with `# NOTE` <!-- Please give a short summary of the change and the problem this solves. --> <!-- For example: "Closes ray-project#1234" --> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Matthew Owen <mowen@anyscale.com>
47761ab to
eb0973d
Compare
|
FYI - In case anyone else saw tons of warnings like this: when using Ray data to load Parquet datasets with schema metadata after upgrading to Ray 2.49.1. It's because PyArrow is sending the schema metadata dict to the hash func directly. |
Why are these changes needed?
Cherry pick two fixes for ray data (from #55854 and #55926).
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.