Skip to content

[train] Cleanups for training ingest benchmark#53684

Merged
justinvyu merged 26 commits intoray-project:masterfrom
justinvyu:benchmark_minimal_cleanup
Jun 19, 2025
Merged

[train] Cleanups for training ingest benchmark#53684
justinvyu merged 26 commits intoray-project:masterfrom
justinvyu:benchmark_minimal_cleanup

Conversation

@justinvyu
Copy link
Copy Markdown
Contributor

Summary

This PR does some cleanup for the training benchmark:

  • Introduces task level configs so that we don't need to create a new task per variant of the image classification task.
  • Moves some configuration setting to logical places (ex: grouping all Ray Data configs in one place).
  • Deduplicates some of the redundant "benchmark factories" that were created for the image classification data format / data storage variants.
  • Misc. file/directory renames for conciseness.

justinvyu added 24 commits June 9, 2025 16:20
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit fd8e8a5.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested review from Copilot and srinathk10 and removed request for Copilot June 10, 2025 01:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the training ingest benchmark by introducing task-level configurations, consolidating and deduplicating image classification factories, and reorganizing where dataloader settings live.

  • Add TaskConfig and ImageClassificationConfig to centralize per-task settings and remove per-variant tasks.
  • Move batch‐size and row‐limit fields into DataLoaderConfig subclasses and update factories to call get_dataloader_config().
  • Deduplicate image‐classification factories (JPEG/Parquet) under a single ImageClassificationFactory using injected data_dirs.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
release/train_tests/benchmark/runner.py Pass dataset_creation_time into get_metrics and update imports
release/train_tests/benchmark/recsys/recsys_factory.py Fix import path for BenchmarkFactory
release/train_tests/benchmark/ray_dataloader_factory.py Add abstract get_ray_datasets and get_ray_data_config
release/train_tests/benchmark/image_classification/parquet/factory.py Inject data_dirs and use get_dataloader_config().limit_*
release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/factory.py Remove obsolete localfs‐JPEG factory
release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/init.py Remove empty module docstring
release/train_tests/benchmark/image_classification/jpeg/factory.py Inject data_dirs, remove hardcoded dirs, and use limits
release/train_tests/benchmark/image_classification/imagenet.py Add IMAGENET_LOCALFS_SPLIT_DIRS and import DatasetKey
release/train_tests/benchmark/image_classification/factory.py New unified ImageClassificationFactory and helper get_imagenet_data_dirs
release/train_tests/benchmark/dataloader_factory.py Remove unused stub methods
release/train_tests/benchmark/config.py Add TaskConfig types and move row‐limit fields to DataLoaderConfig
release/train_tests/benchmark/benchmark_factory.py Remove deprecated dataset methods
release/release_tests.yaml Update test scripts for new task and flag names

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Copy Markdown
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for the restructure.

datasets = {}
data_config = None

factory.set_dataset_creation_time(time.perf_counter() - start_time)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dataset creation time previously did not capture the actual Ray Dataset construction (in get_ray_datasets).

I updated it to capture the range. Just want to double check that this is accurate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Good catch! Hope that get_ray_datasets call is negligible (sub-second).

@justinvyu justinvyu enabled auto-merge (squash) June 18, 2025 23:22
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 18, 2025
@justinvyu justinvyu merged commit 1957ce2 into ray-project:master Jun 19, 2025
7 checks passed
@justinvyu justinvyu deleted the benchmark_minimal_cleanup branch June 19, 2025 00:26
srinathk10 added a commit that referenced this pull request Jun 20, 2025
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
This PR does some cleanup for the training benchmark:
* Introduces task level configs so that we don't need to create a new
task per variant of the image classification task.
* Moves some configuration setting to logical places (ex: grouping all
Ray Data configs in one place).
* Deduplicates some of the redundant "benchmark factories" that were
created for the image classification data format / data storage
variants.
* Misc. file/directory renames for conciseness.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
This PR does some cleanup for the training benchmark:
* Introduces task level configs so that we don't need to create a new
task per variant of the image classification task.
* Moves some configuration setting to logical places (ex: grouping all
Ray Data configs in one place).
* Deduplicates some of the redundant "benchmark factories" that were
created for the image classification data format / data storage
variants.
* Misc. file/directory renames for conciseness.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants