Skip to content

[train] Add hf trainer support for dictionary of datasets#56046

Closed
wyhong3103 wants to merge 179 commits intoray-project:masterfrom
wyhong3103:support-hf-dict-of-datasets
Closed

[train] Add hf trainer support for dictionary of datasets#56046
wyhong3103 wants to merge 179 commits intoray-project:masterfrom
wyhong3103:support-hf-dict-of-datasets

Conversation

@wyhong3103
Copy link
Copy Markdown
Contributor

@wyhong3103 wyhong3103 commented Aug 28, 2025

Use case

To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the eval_dataset argument type in get_eval_dataloader in transformers.

Why are these changes needed?

If users provide a dictionary of Ray datasets as eval_datasets to HuggingFace Trainer, it will fail during evaluation step because the current dataloader returned is NOT of the type RayTorchIterableDataset.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wyhong3103 wyhong3103 requested a review from a team as a code owner August 28, 2025 13:22
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for using a dictionary of Ray Datasets for evaluation in the HuggingFace Trainer. The change correctly handles the case where eval_dataset is a string key into a dictionary of datasets. My review focuses on improving the implementation to be more robust and maintainable by addressing a potential KeyError and reducing code duplication.

@xinyuangui2 xinyuangui2 self-requested a review August 28, 2025 16:59
@xinyuangui2
Copy link
Copy Markdown
Contributor

Can we fix the lint and add the use case in the PR?

@ray-gardener ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues community-contribution Contributed by the community labels Aug 28, 2025
@wyhong3103
Copy link
Copy Markdown
Contributor Author

Done!

@xinyuangui2
Copy link
Copy Markdown
Contributor

Can we also add tests to

https://github.com/ray-project/ray/blob/master/python/ray/train/tests/test_torch_transformers_train.py
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/tests/test_torch_transformers_train.py

It can also showcase how the new changes are used. Thanks

@gvspraveen gvspraveen removed the data Ray Data-related issues label Sep 2, 2025
@wyhong3103
Copy link
Copy Markdown
Contributor Author

Done!

@wyhong3103 wyhong3103 force-pushed the support-hf-dict-of-datasets branch from b3997c7 to ec4ff6d Compare September 8, 2025 09:08
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also mention that this change aligns with the transformer function
https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196
in the PR description.

@@ -131,7 +131,14 @@ def get_eval_dataloader(
if eval_dataset is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Update eval_dataset: Optional[Dataset] to eval_dataset: Optional[Union[str, Dataset]] = None

@wyhong3103
Copy link
Copy Markdown
Contributor Author

Done!

@xinyuangui2
Copy link
Copy Markdown
Contributor

Let's sign off the PR to bypass DCO test, and then we are good to go.

Copy link
Copy Markdown

@reubenvanammers reubenvanammers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pretty much the same as my PR for the same issue - happy to get this in.

dayshah and others added 12 commits September 12, 2025 09:57
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
ray-project#55955)

Involved splitting `GcsPlacementGroup` into its own file to break
circular dependency.

Also eliminated `gcs_server_test_util.h` by:

1. Updating the global fakes with the implementation that was defined in
the file.
2. Moving other utilities into the single file that depended on them.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
)

## Why are these changes needed?

`serve.shutdown()` is best effort in async contexts. we should use
`shutdown_async()`

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
so that action env vars are properly captured and sandboxed, makes
caching working properly.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…e-test-attr-regex-filters (ray-project#56039)

`release-test-attr-regex-filters` is the metadata being used in the
input box for Release pipeline which we need to make it backward
compatible still after adding `release-test-filters`

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Add Ray Train maintainers as the CODEOWNERS for `/python/ray/air`
sub-directory.

Without this change, this directory will have Ray Core assigned as the
CODEOWNER due to [this
line](https://github.com/matthewdeng/ray/blob/ca2d866a95385aea4fd3de1c3b9f5148f899c282/.github/CODEOWNERS#L20-L21).

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
alexeykudinkin and others added 13 commits September 12, 2025 10:11
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…56289)

so that they are limited ot python related changes, rather than all
changes.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
upgrading boto3

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
… Dataset.stats() (ray-project#56040)

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…ct#55851) (ray-project#56094)

## Why are these changes needed?

Currently, when Serve file logs are buffered via a `MemoryHandler`,
`ServeContextFilter` fetches the serve request context at flush time
instead of when the log record is emitted. As a result, many log records
flushed together can share the same request context, breaking per
request tracing.
This PR captures the request context at emit time when buffering is
enabled and makes the filter idempotent so it won’t overwrite pre
populated fields. This preserves correct per record context for buffered
file logs without changing non buffered behavior.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
Closes ray-project#55851

## Performance Testing
Manual Verification - Benchmarked both buffered and non buffered cases
with and without the fix.
Performance- Used Locust with 100 users for a duration of 3-4 mins

Without buffering:
With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4`
Without fix:  `391.29(ms), P99: 560(ms), RPS: 239`

With buffering:
set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000
With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5`
Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4`

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…t#56261)

Signed-off-by: Rueian <rueian@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
@wyhong3103 wyhong3103 force-pushed the support-hf-dict-of-datasets branch from 7676d2d to 56c21c5 Compare September 12, 2025 02:12
@wyhong3103
Copy link
Copy Markdown
Contributor Author

wyhong3103 commented Sep 12, 2025

@xinyuangui2 Noob question, just wondering if this was supposed to happen, i simply followed the instructions in the DCO test result from earlier.

Let me know if this is wrong. I will re-create a PR.

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 Noob question, just wondering if this was supposed to happen, i simply followed the instructions in the DCO test result from earlier.

Let me know if this is wrong. I will re-create a PR.

Let's recreate a PR. This one includes a lot of changes by accident.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.