Skip to content

Calculate _num_batches_to_skip based on global_rows_processed_this_epoch#55964

Merged
matthewdeng merged 6 commits intoray-project:masterfrom
liulehui:release-test
Aug 28, 2025
Merged

Calculate _num_batches_to_skip based on global_rows_processed_this_epoch#55964
matthewdeng merged 6 commits intoray-project:masterfrom
liulehui:release-test

Conversation

@liulehui
Copy link
Copy Markdown
Contributor

@liulehui liulehui commented Aug 26, 2025

Why are these changes needed?

  1. Previously, we use _restored_train_batch_idx as a run_state to determine how many batches to skip when resuming training.
  2. In this PR we introduced a _global_rows_processed_this_epoch and _num_batches_to_skip instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers.
  3. Added one checkpoint_every_n_steps: int = -1 config so that we can separate validation and checkpoint frequency.
  4. release test run: https://buildkite.com/ray-project/release/builds/55274

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the logic for skipping batches during training resumption to be more robust, particularly when the number of workers changes. It achieves this by introducing _global_rows_processed_this_epoch and _num_batches_to_skip to replace the previous _restored_train_batch_idx mechanism. Additionally, it decouples checkpointing from validation by adding a new checkpoint_every_n_steps configuration. The changes are well-structured and improve the flexibility of the benchmark runner. I have identified a couple of areas for improvement in runner.py: one concerning code duplication and another related to a potential logic bug in the new checkpointing flow.

Comment on lines +160 to 163
global_batch_size = (
self.benchmark_config.dataloader_config.train_batch_size
* ray.train.get_context().get_world_size()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This global_batch_size calculation is duplicated from the _num_batches_to_skip property (lines 128-131). To improve maintainability and avoid potential inconsistencies, consider extracting this logic into a shared helper property or method.

@liulehui liulehui force-pushed the release-test branch 2 times, most recently from 7121434 to 4b4155d Compare August 26, 2025 22:34
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Aug 27, 2025
@liulehui
Copy link
Copy Markdown
Contributor Author

Succeed release test run: https://buildkite.com/ray-project/release/builds/55274

@liulehui liulehui requested a review from justinvyu August 27, 2025 16:24
@liulehui liulehui added the go add ONLY when ready to merge, run all tests label Aug 27, 2025
@liulehui liulehui requested a review from matthewdeng August 27, 2025 20:05
@matthewdeng matthewdeng merged commit 0a2bacf into ray-project:master Aug 28, 2025
5 checks passed
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@liulehui liulehui deleted the release-test branch September 2, 2025 22:09
gangsf pushed a commit to gangsf/ray that referenced this pull request Sep 2, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274


Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…och (#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…och (ray-project#55964)

1. Previously, we use `_restored_train_batch_idx` as a run_state to
determine how many batches to skip when resuming training.
2. In this PR we introduced a `_global_rows_processed_this_epoch` and
`_num_batches_to_skip` instead so that it will be easier for us to
calculate the num_batches to skip when resuming training with different
number of workers.
3. Added one `checkpoint_every_n_steps: int = -1` config so that we can
separate validation and checkpoint frequency.
4. release test run:
https://buildkite.com/ray-project/release/builds/55274


Signed-off-by: Lehui Liu <lehui@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants