[train] Add hf trainer support for dictionary of datasets#56046
[train] Add hf trainer support for dictionary of datasets#56046wyhong3103 wants to merge 179 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for using a dictionary of Ray Datasets for evaluation in the HuggingFace Trainer. The change correctly handles the case where eval_dataset is a string key into a dictionary of datasets. My review focuses on improving the implementation to be more robust and maintainable by addressing a potential KeyError and reducing code duplication.
|
Can we fix the lint and add the use case in the PR? |
|
Done! |
|
Can we also add tests to https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_torch_transformers_train.py It can also showcase how the new changes are used. Thanks |
|
Done! |
b3997c7 to
ec4ff6d
Compare
xinyuangui2
left a comment
There was a problem hiding this comment.
Let's also mention that this change aligns with the transformer function
https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196
in the PR description.
| @@ -131,7 +131,14 @@ def get_eval_dataloader( | |||
| if eval_dataset is None: | |||
There was a problem hiding this comment.
nit: Update eval_dataset: Optional[Dataset] to eval_dataset: Optional[Union[str, Dataset]] = None
|
Done! |
|
Let's sign off the PR to bypass DCO test, and then we are good to go. |
reubenvanammers
left a comment
There was a problem hiding this comment.
LGTM, pretty much the same as my PR for the same issue - happy to get this in.
Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…6033) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
ray-project#55955) Involved splitting `GcsPlacementGroup` into its own file to break circular dependency. Also eliminated `gcs_server_test_util.h` by: 1. Updating the global fakes with the implementation that was defined in the file. 2. Moving other utilities into the single file that depended on them. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
) ## Why are these changes needed? `serve.shutdown()` is best effort in async contexts. we should use `shutdown_async()` ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
so that action env vars are properly captured and sandboxed, makes caching working properly. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…e-test-attr-regex-filters (ray-project#56039) `release-test-attr-regex-filters` is the metadata being used in the input box for Release pipeline which we need to make it backward compatible still after adding `release-test-filters` Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Add Ray Train maintainers as the CODEOWNERS for `/python/ray/air` sub-directory. Without this change, this directory will have Ray Core assigned as the CODEOWNER due to [this line](https://github.com/matthewdeng/ray/blob/ca2d866a95385aea4fd3de1c3b9f5148f899c282/.github/CODEOWNERS#L20-L21). Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…ct#56268) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…56289) so that they are limited ot python related changes, rather than all changes. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
upgrading boto3 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
… Dataset.stats() (ray-project#56040) Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…t#56261) Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
7676d2d to
56c21c5
Compare
|
@xinyuangui2 Noob question, just wondering if this was supposed to happen, i simply followed the instructions in the DCO test result from earlier. Let me know if this is wrong. I will re-create a PR. |
Let's recreate a PR. This one includes a lot of changes by accident. |
Use case
To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the
eval_datasetargument type inget_eval_dataloaderin transformers.Why are these changes needed?
If users provide a dictionary of Ray datasets as
eval_datasetsto HuggingFaceTrainer, it will fail during evaluation step because the current dataloader returned is NOT of the typeRayTorchIterableDataset.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.