[train] Add hf trainer support for dictionary of datasets by wyhong3103 · Pull Request #56046 · ray-project/ray

wyhong3103 · 2025-08-28T13:22:11Z

Use case

To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the eval_dataset argument type in get_eval_dataloader in transformers.

Why are these changes needed?

If users provide a dictionary of Ray datasets as eval_datasets to HuggingFace Trainer, it will fail during evaluation step because the current dataloader returned is NOT of the type RayTorchIterableDataset.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request adds support for using a dictionary of Ray Datasets for evaluation in the HuggingFace Trainer. The change correctly handles the case where eval_dataset is a string key into a dictionary of datasets. My review focuses on improving the implementation to be more robust and maintainable by addressing a potential KeyError and reducing code duplication.

xinyuangui2 · 2025-08-28T16:59:37Z

Can we fix the lint and add the use case in the PR?

wyhong3103 · 2025-08-29T01:55:20Z

Done!

xinyuangui2 · 2025-09-02T16:33:58Z

Can we also add tests to

https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_torch_transformers_train.py
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/tests/test_torch_transformers_train.py

It can also showcase how the new changes are used. Thanks

wyhong3103 · 2025-09-08T03:33:19Z

Done!

xinyuangui2

Let's also mention that this change aligns with the transformer function
https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196
in the PR description.

xinyuangui2 · 2025-09-09T17:44:06Z

python/ray/train/huggingface/transformers/_transformers_utils.py

@@ -131,7 +131,14 @@ def get_eval_dataloader(
            if eval_dataset is None:


nit: Update eval_dataset: Optional[Dataset] to eval_dataset: Optional[Union[str, Dataset]] = None

wyhong3103 · 2025-09-11T12:13:20Z

Done!

xinyuangui2 · 2025-09-11T17:10:20Z

Let's sign off the PR to bypass DCO test, and then we are good to go.

reubenvanammers

LGTM, pretty much the same as my PR for the same issue - happy to get this in.

Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…6033) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

ray-project#55955) Involved splitting `GcsPlacementGroup` into its own file to break circular dependency. Also eliminated `gcs_server_test_util.h` by: 1. Updating the global fakes with the implementation that was defined in the file. 2. Moving other utilities into the single file that depended on them. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

) ## Why are these changes needed? `serve.shutdown()` is best effort in async contexts. we should use `shutdown_async()` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

so that action env vars are properly captured and sandboxed, makes caching working properly. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…e-test-attr-regex-filters (ray-project#56039) `release-test-attr-regex-filters` is the metadata being used in the input box for Release pipeline which we need to make it backward compatible still after adding `release-test-filters` Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Add Ray Train maintainers as the CODEOWNERS for `/python/ray/air` sub-directory. Without this change, this directory will have Ray Core assigned as the CODEOWNER due to [this line](https://github.com/matthewdeng/ray/blob/ca2d866a95385aea4fd3de1c3b9f5148f899c282/.github/CODEOWNERS#L20-L21). Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…ct#56268)   ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…56289) so that they are limited ot python related changes, rather than all changes. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

upgrading boto3 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

… Dataset.stats() (ray-project#56040) Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior.  ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4`  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…t#56261) Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

wyhong3103 · 2025-09-12T02:27:09Z

@xinyuangui2 Noob question, just wondering if this was supposed to happen, i simply followed the instructions in the DCO test result from earlier.

Let me know if this is wrong. I will re-create a PR.

xinyuangui2 · 2025-09-12T04:38:47Z

@xinyuangui2 Noob question, just wondering if this was supposed to happen, i simply followed the instructions in the DCO test result from earlier.

Let me know if this is wrong. I will re-create a PR.

Let's recreate a PR. This one includes a lot of changes by accident.

wyhong3103 requested a review from a team as a code owner August 28, 2025 13:22

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

xinyuangui2 self-requested a review August 28, 2025 16:59

ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues community-contribution Contributed by the community labels Aug 28, 2025

gvspraveen removed the data Ray Data-related issues label Sep 2, 2025

wyhong3103 force-pushed the support-hf-dict-of-datasets branch from b3997c7 to ec4ff6d Compare September 8, 2025 09:08

xinyuangui2 approved these changes Sep 9, 2025

View reviewed changes

matthewdeng mentioned this pull request Sep 9, 2025

[train] Adds ability to use dictionary of Ray Data Iterables for Huggingface Trainer #56265

Closed

8 tasks

wyhong3103 requested a review from xinyuangui2 September 11, 2025 14:15

reubenvanammers approved these changes Sep 12, 2025

View reviewed changes

dayshah and others added 12 commits September 12, 2025 09:57

[core] Turn off -wshadow for tests (ray-project#56030)

fdb9520

Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

[core] Remove perf test in shutdown_coordinator_test (ray-project#5…

e8d8d14

…6033) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Add hf trainer support for dictionary of datasets

69d7c57

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Lint file

b31afef

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Add tests

b347f7e

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Lint files

bbce5f1

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

[ci] enable bazel strict action env by default (ray-project#55985)

eaf0aae

so that action env vars are properly captured and sandboxed, makes caching working properly. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

alexeykudinkin and others added 13 commits September 12, 2025 10:11

[lint] change API annotations and policy check triggers (ray-project#…

38a5ade

…56289) so that they are limited ot python related changes, rather than all changes. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

[deps] upgrade boto3 (ray-project#56291)

6c9ff69

upgrading boto3 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

[Data] Add total input/output row counts of Operator in the output of…

31f0e72

… Dataset.stats() (ray-project#56040) Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

[core] replace node_manager_client with raylet_client_lib (ray-projec…

8ff74b8

…t#56261) Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

[core] Delete unnecessary plasma.cc + macro (ray-project#56308)

5a67a15

Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Fix test

ecf9cc6

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Fix tests

67a719b

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Lint file

1cbe30c

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Add init value for test config

e80f001

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Lint files

89a0a2f

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Add correct type

56c21c5

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

wyhong3103 force-pushed the support-hf-dict-of-datasets branch from 7676d2d to 56c21c5 Compare September 12, 2025 02:12

wyhong3103 requested review from a team, aslonnie, edoakes, jjyao and richardliaw as code owners September 12, 2025 02:12

xinyuangui2 mentioned this pull request Sep 12, 2025

[train] Add hf trainer support for dictionary of datasets #56484

Merged

8 tasks

wyhong3103 closed this Sep 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add hf trainer support for dictionary of datasets#56046

[train] Add hf trainer support for dictionary of datasets#56046
wyhong3103 wants to merge 179 commits intoray-project:masterfrom
wyhong3103:support-hf-dict-of-datasets

wyhong3103 commented Aug 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

xinyuangui2 commented Aug 28, 2025

Uh oh!

wyhong3103 commented Aug 29, 2025

Uh oh!

xinyuangui2 commented Sep 2, 2025

Uh oh!

wyhong3103 commented Sep 8, 2025

Uh oh!

xinyuangui2 left a comment

Uh oh!

xinyuangui2 Sep 9, 2025

Uh oh!

wyhong3103 commented Sep 11, 2025

Uh oh!

xinyuangui2 commented Sep 11, 2025

Uh oh!

reubenvanammers left a comment

Uh oh!

wyhong3103 commented Sep 12, 2025 •

edited

Loading

Uh oh!

xinyuangui2 commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

		@@ -131,7 +131,14 @@ def get_eval_dataloader(
		if eval_dataset is None:

Conversation

wyhong3103 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use case

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

xinyuangui2 commented Aug 28, 2025

Uh oh!

wyhong3103 commented Aug 29, 2025

Uh oh!

xinyuangui2 commented Sep 2, 2025

Uh oh!

wyhong3103 commented Sep 8, 2025

Uh oh!

xinyuangui2 left a comment

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

wyhong3103 commented Sep 11, 2025

Uh oh!

xinyuangui2 commented Sep 11, 2025

Uh oh!

reubenvanammers left a comment

Choose a reason for hiding this comment

Uh oh!

wyhong3103 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinyuangui2 commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

wyhong3103 commented Aug 28, 2025 •

edited

Loading

wyhong3103 commented Sep 12, 2025 •

edited

Loading