[train] Cleanups for training ingest benchmark#53684
[train] Cleanups for training ingest benchmark#53684justinvyu merged 26 commits intoray-project:masterfrom
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit fd8e8a5. Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the training ingest benchmark by introducing task-level configurations, consolidating and deduplicating image classification factories, and reorganizing where dataloader settings live.
- Add
TaskConfigandImageClassificationConfigto centralize per-task settings and remove per-variant tasks. - Move batch‐size and row‐limit fields into
DataLoaderConfigsubclasses and update factories to callget_dataloader_config(). - Deduplicate image‐classification factories (JPEG/Parquet) under a single
ImageClassificationFactoryusing injecteddata_dirs.
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| release/train_tests/benchmark/runner.py | Pass dataset_creation_time into get_metrics and update imports |
| release/train_tests/benchmark/recsys/recsys_factory.py | Fix import path for BenchmarkFactory |
| release/train_tests/benchmark/ray_dataloader_factory.py | Add abstract get_ray_datasets and get_ray_data_config |
| release/train_tests/benchmark/image_classification/parquet/factory.py | Inject data_dirs and use get_dataloader_config().limit_* |
| release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/factory.py | Remove obsolete localfs‐JPEG factory |
| release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/init.py | Remove empty module docstring |
| release/train_tests/benchmark/image_classification/jpeg/factory.py | Inject data_dirs, remove hardcoded dirs, and use limits |
| release/train_tests/benchmark/image_classification/imagenet.py | Add IMAGENET_LOCALFS_SPLIT_DIRS and import DatasetKey |
| release/train_tests/benchmark/image_classification/factory.py | New unified ImageClassificationFactory and helper get_imagenet_data_dirs |
| release/train_tests/benchmark/dataloader_factory.py | Remove unused stub methods |
| release/train_tests/benchmark/config.py | Add TaskConfig types and move row‐limit fields to DataLoaderConfig |
| release/train_tests/benchmark/benchmark_factory.py | Remove deprecated dataset methods |
| release/release_tests.yaml | Update test scripts for new task and flag names |
…hmark_minimal_cleanup
srinathk10
left a comment
There was a problem hiding this comment.
Nice. Thanks for the restructure.
| datasets = {} | ||
| data_config = None | ||
|
|
||
| factory.set_dataset_creation_time(time.perf_counter() - start_time) |
There was a problem hiding this comment.
I think the dataset creation time previously did not capture the actual Ray Dataset construction (in get_ray_datasets).
I updated it to capture the range. Just want to double check that this is accurate.
There was a problem hiding this comment.
Ah ok. Good catch! Hope that get_ray_datasets call is negligible (sub-second).
This reverts commit 1957ce2.
This PR does some cleanup for the training benchmark: * Introduces task level configs so that we don't need to create a new task per variant of the image classification task. * Moves some configuration setting to logical places (ex: grouping all Ray Data configs in one place). * Deduplicates some of the redundant "benchmark factories" that were created for the image classification data format / data storage variants. * Misc. file/directory renames for conciseness. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This PR does some cleanup for the training benchmark: * Introduces task level configs so that we don't need to create a new task per variant of the image classification task. * Moves some configuration setting to logical places (ex: grouping all Ray Data configs in one place). * Deduplicates some of the redundant "benchmark factories" that were created for the image classification data format / data storage variants. * Misc. file/directory renames for conciseness. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Summary
This PR does some cleanup for the training benchmark: