Calculate _num_batches_to_skip based on global_rows_processed_this_epoch by liulehui · Pull Request #55964 · ray-project/ray

liulehui · 2025-08-26T20:34:06Z

Why are these changes needed?

Previously, we use _restored_train_batch_idx as a run_state to determine how many batches to skip when resuming training.
In this PR we introduced a _global_rows_processed_this_epoch and _num_batches_to_skip instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers.
Added one checkpoint_every_n_steps: int = -1 config so that we can separate validation and checkpoint frequency.
release test run: https://buildkite.com/ray-project/release/builds/55274

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request refactors the logic for skipping batches during training resumption to be more robust, particularly when the number of workers changes. It achieves this by introducing _global_rows_processed_this_epoch and _num_batches_to_skip to replace the previous _restored_train_batch_idx mechanism. Additionally, it decouples checkpointing from validation by adding a new checkpoint_every_n_steps configuration. The changes are well-structured and improve the flexibility of the benchmark runner. I have identified a couple of areas for improvement in runner.py: one concerning code duplication and another related to a potential logic bug in the new checkpointing flow.

release/train_tests/benchmark/runner.py

gemini-code-assist · 2025-08-26T20:36:13Z

release/train_tests/benchmark/runner.py

+            global_batch_size = (
                self.benchmark_config.dataloader_config.train_batch_size
+                * ray.train.get_context().get_world_size()
            )


This global_batch_size calculation is duplicated from the _num_batches_to_skip property (lines 128-131). To improve maintainability and avoid potential inconsistencies, consider extracting this logic into a shared helper property or method.

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui · 2025-08-27T16:23:33Z

Succeed release test run: https://buildkite.com/ray-project/release/builds/55274

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

…och (#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com>

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

liulehui force-pushed the release-test branch 2 times, most recently from 7121434 to 4b4155d Compare August 26, 2025 22:34

liulehui added 4 commits August 26, 2025 17:19

Add _num_batches_to_skip as a property

fc8c6c6

Signed-off-by: Lehui Liu <lehui@anyscale.com>

fix comment

6f33cb7

Signed-off-by: Lehui Liu <lehui@anyscale.com>

change release test type to g6 machine

4ef4439

Signed-off-by: Lehui Liu <lehui@anyscale.com>

back to g4

ee5ec95

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui force-pushed the release-test branch from 4b4155d to ee5ec95 Compare August 27, 2025 00:19

add some missing logic

34bf37b

Signed-off-by: Lehui Liu <lehui@anyscale.com>

ray-gardener bot added the train Ray Train Related Issue label Aug 27, 2025

liulehui requested a review from justinvyu August 27, 2025 16:24

justinvyu approved these changes Aug 27, 2025

View reviewed changes

liulehui added the go add ONLY when ready to merge, run all tests label Aug 27, 2025

liulehui requested a review from matthewdeng August 27, 2025 20:05

Merge branch 'master' into release-test

06ebe15

matthewdeng merged commit 0a2bacf into ray-project:master Aug 28, 2025
5 checks passed

liulehui deleted the release-test branch September 2, 2025 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate _num_batches_to_skip based on global_rows_processed_this_epoch#55964

Calculate _num_batches_to_skip based on global_rows_processed_this_epoch#55964
matthewdeng merged 6 commits intoray-project:masterfrom
liulehui:release-test

liulehui commented Aug 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

liulehui commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liulehui commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

liulehui commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liulehui commented Aug 26, 2025 •

edited

Loading