Calculate _num_batches_to_skip based on global_rows_processed_this_epoch#55964
Calculate _num_batches_to_skip based on global_rows_processed_this_epoch#55964matthewdeng merged 6 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for skipping batches during training resumption to be more robust, particularly when the number of workers changes. It achieves this by introducing _global_rows_processed_this_epoch and _num_batches_to_skip to replace the previous _restored_train_batch_idx mechanism. Additionally, it decouples checkpointing from validation by adding a new checkpoint_every_n_steps configuration. The changes are well-structured and improve the flexibility of the benchmark runner. I have identified a couple of areas for improvement in runner.py: one concerning code duplication and another related to a potential logic bug in the new checkpointing flow.
| global_batch_size = ( | ||
| self.benchmark_config.dataloader_config.train_batch_size | ||
| * ray.train.get_context().get_world_size() | ||
| ) |
7121434 to
4b4155d
Compare
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
|
Succeed release test run: https://buildkite.com/ray-project/release/builds/55274 |
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…och (#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…och (ray-project#55964) 1. Previously, we use `_restored_train_batch_idx` as a run_state to determine how many batches to skip when resuming training. 2. In this PR we introduced a `_global_rows_processed_this_epoch` and `_num_batches_to_skip` instead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers. 3. Added one `checkpoint_every_n_steps: int = -1` config so that we can separate validation and checkpoint frequency. 4. release test run: https://buildkite.com/ray-project/release/builds/55274 Signed-off-by: Lehui Liu <lehui@anyscale.com>
Why are these changes needed?
_restored_train_batch_idxas a run_state to determine how many batches to skip when resuming training._global_rows_processed_this_epochand_num_batches_to_skipinstead so that it will be easier for us to calculate the num_batches to skip when resuming training with different number of workers.checkpoint_every_n_steps: int = -1config so that we can separate validation and checkpoint frequency.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.