[train][doc] Add validation and other details to checkpoint doc#57065
[train][doc] Add validation and other details to checkpoint doc#57065matthewdeng merged 24 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the documentation for Ray Train checkpoints by adding details on validation and various upload modes. The new code examples are valuable, but I've identified critical issues in the asynchronous checkpointing examples concerning the lifecycle of temporary directories, which could lead to runtime failures. I've also pointed out a couple of inaccuracies in the documentation text. My feedback includes suggestions to correct these issues to ensure the examples are robust and the documentation is accurate.
c5741d9 to
ef44dd8
Compare
aslonnie
left a comment
There was a problem hiding this comment.
@elliot-barn to review the dependency change.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
9c28ce9 to
636cb68
Compare
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Thanks!
Questions to answer in the user guide:
- How do I set resources on the validation group?
- Emphasize that we can set resources differently on the validation group to leverage a heterogeneous cluster. Ex: show that you can use a different GPU types (ex: A100 for training and A10G for inference, which has lower GRAM requirement). I think we should also show a best practice to schedule even CPU tasks onto different nodes to prevent validation from interfering with training.
- What's the lifecycle of "metrics" attached to the checkpoint? Partial metrics -> validation -> Full metrics
- How can I access my validation results at the end of training?
- Followup: How can I report my validation results to wandb?
…do to request gpu Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Done. A few comments:
|
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
| .. literalinclude:: ../doc_code/asynchronous_validation.py | ||
| :language: python | ||
| :start-after: __validate_fn_simple_start__ | ||
| :end-before: __validate_fn_simple_end__ |
There was a problem hiding this comment.
This example is not too practical because we don't allow the user to specify to run this validate_fn as a GPU task. So they can only do CPU inference here. What about just keeping this as a skeleton function and point the user to the section below on how to write a distributed validate function.=
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Done. Ended up using |
…project#57065) # Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…project#57065) # Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>
# Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…project#57065) # Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…project#57065) # Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…project#57065) # Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Summary
Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory.
Testing
docbuild
Note
Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver.
CheckpointUploadMode(sync/async/custom) and async checkpoint validation withvalidate_fn; include new examples and figure.train.CheckpointUploadModetodoc/source/train/api/api.rst.checkpoints.pysnippets forCheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD, Torch/XLA validation viaTorchTrainer, Ray Datamap_batches, and reporting withvalidate_function.s3torchconnector==1.4.3(and transitivelys3torchconnectorclient==1.4.3) topython/requirements_*for custom S3 uploads.ci/env/install-miniforge.shinci/docker/base.ml.wanda.yamlbuild context.ci/env/install-miniforge.sh, installmpi4pywithconda ... --solver classicfor Python <3.12.Written by Cursor Bugbot for commit c5741d9. This will update automatically on new commits. Configure here.