Skip to content

[train][doc] Add validation and other details to checkpoint doc#57065

Merged
matthewdeng merged 24 commits intoray-project:masterfrom
TimothySeah:tseah/validating-checkpoints
Oct 16, 2025
Merged

[train][doc] Add validation and other details to checkpoint doc#57065
matthewdeng merged 24 commits intoray-project:masterfrom
TimothySeah:tseah/validating-checkpoints

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Oct 1, 2025

Summary

Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory.

Testing

docbuild


Note

Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver.

  • Docs (Train):
    • Checkpoints guide: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on CheckpointUploadMode (sync/async/custom) and async checkpoint validation with validate_fn; include new examples and figure.
    • API refs: Add train.CheckpointUploadMode to doc/source/train/api/api.rst.
    • Doc examples: Add checkpoints.py snippets for CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD, Torch/XLA validation via TorchTrainer, Ray Data map_batches, and reporting with validate_function.
    • Monitoring & logging: Note validating checkpoints as a primary reporting use case.
  • Dependencies:
    • Add s3torchconnector==1.4.3 (and transitively s3torchconnectorclient==1.4.3) to python/requirements_* for custom S3 uploads.
  • CI:
    • Include ci/env/install-miniforge.sh in ci/docker/base.ml.wanda.yaml build context.
    • In ci/env/install-miniforge.sh, install mpi4py with conda ... --solver classic for Python <3.12.

Written by Cursor Bugbot for commit c5741d9. This will update automatically on new commits. Configure here.

@TimothySeah TimothySeah requested review from a team as code owners October 1, 2025 00:01
cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the documentation for Ray Train checkpoints by adding details on validation and various upload modes. The new code examples are valuable, but I've identified critical issues in the asynchronous checkpointing examples concerning the lifecycle of temporary directories, which could lead to runtime failures. I've also pointed out a couple of inaccuracies in the documentation text. My feedback includes suggestions to correct these issues to ensure the examples are robust and the documentation is accurate.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue labels Oct 1, 2025
@elliot-barn elliot-barn requested a review from a team as a code owner October 2, 2025 00:00
@TimothySeah TimothySeah force-pushed the tseah/validating-checkpoints branch from c5741d9 to ef44dd8 Compare October 2, 2025 18:04
cursor[bot]

This comment was marked as outdated.

@aslonnie aslonnie requested a review from elliot-barn October 3, 2025 04:29
Copy link
Copy Markdown
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elliot-barn to review the dependency change.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah force-pushed the tseah/validating-checkpoints branch from 9c28ce9 to 636cb68 Compare October 3, 2025 19:11
Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Questions to answer in the user guide:

  • How do I set resources on the validation group?
    • Emphasize that we can set resources differently on the validation group to leverage a heterogeneous cluster. Ex: show that you can use a different GPU types (ex: A100 for training and A10G for inference, which has lower GRAM requirement). I think we should also show a best practice to schedule even CPU tasks onto different nodes to prevent validation from interfering with training.
  • What's the lifecycle of "metrics" attached to the checkpoint? Partial metrics -> validation -> Full metrics
  • How can I access my validation results at the end of training?
  • Followup: How can I report my validation results to wandb?

…do to request gpu

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

TimothySeah commented Oct 9, 2025

Thanks!

Questions to answer in the user guide:

  • How do I set resources on the validation group?

    • Emphasize that we can set resources differently on the validation group to leverage a heterogeneous cluster. Ex: show that you can use a different GPU types (ex: A100 for training and A10G for inference, which has lower GRAM requirement). I think we should also show a best practice to schedule even CPU tasks onto different nodes to prevent validation from interfering with training.
  • What's the lifecycle of "metrics" attached to the checkpoint? Partial metrics -> validation -> Full metrics

  • How can I access my validation results at the end of training?

  • Followup: How can I report my validation results to wandb?

Done. A few comments:

  • Looks like https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html cannot request specific gpu types. CC @ray-project/ray-data .
  • Once I modify get_all_reported_checkpoints to work with async validation, I will add it to the checkpoint metrics lifecycle section.
  • Will document experiment tracking in a followup PR since that might take a few steps
  • I think we should also show a best practice to schedule even CPU tasks onto different nodes to prevent validation from interfering with training. Discussed offline - will do this in a followup.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Oct 13, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Copy Markdown
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some style nits.
Also, in the diagram, there should be a comma after the latin abbreviation i.e..

Comment on lines +28 to +31
.. literalinclude:: ../doc_code/asynchronous_validation.py
:language: python
:start-after: __validate_fn_simple_start__
:end-before: __validate_fn_simple_end__
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is not too practical because we don't allow the user to specify to run this validate_fn as a GPU task. So they can only do CPU inference here. What about just keeping this as a skeleton function and point the user to the section below on how to write a distributed validate function.=

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - @matthewdeng FYI.

TimothySeah and others added 2 commits October 15, 2025 15:53
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

Just some style nits. Also, in the diagram, there should be a comma after the latin abbreviation i.e..

Done. Ended up using : {} at @justinvyu 's suggestion instead of .e.g.

@matthewdeng matthewdeng merged commit 67c610b into ray-project:master Oct 16, 2025
6 checks passed
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…project#57065)

# Summary

Add validation to checkpoint doc. Also mention that async checkpoint
uploading requires you to create a long-lasting checkpoint directory.

# Testing

docbuild

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Documents checkpoint upload modes and async validation with new
examples and API refs, adds s3torchconnector deps, and updates CI conda
install to use the classic solver.
> 
> - **Docs (Train)**:
> - **Checkpoints guide**: Rename to "Saving, Validating, and Loading
Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom)
and async checkpoint validation with `validate_fn`; include new examples
and figure.
> - **API refs**: Add `train.CheckpointUploadMode` to
`doc/source/train/api/api.rst`.
> - **Doc examples**: Add `checkpoints.py` snippets for
`CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via
`TorchTrainer`, Ray Data `map_batches`, and reporting with
`validate_function`.
> - **Monitoring & logging**: Note validating checkpoints as a primary
reporting use case.
> - **Dependencies**:
> - Add `s3torchconnector==1.4.3` (and transitively
`s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom
S3 uploads.
> - **CI**:
> - Include `ci/env/install-miniforge.sh` in
`ci/docker/base.ml.wanda.yaml` build context.
> - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ...
--solver classic` for Python <3.12.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5741d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…project#57065)

# Summary

Add validation to checkpoint doc. Also mention that async checkpoint
uploading requires you to create a long-lasting checkpoint directory.

# Testing

docbuild

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Documents checkpoint upload modes and async validation with new
examples and API refs, adds s3torchconnector deps, and updates CI conda
install to use the classic solver.
>
> - **Docs (Train)**:
> - **Checkpoints guide**: Rename to "Saving, Validating, and Loading
Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom)
and async checkpoint validation with `validate_fn`; include new examples
and figure.
> - **API refs**: Add `train.CheckpointUploadMode` to
`doc/source/train/api/api.rst`.
> - **Doc examples**: Add `checkpoints.py` snippets for
`CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via
`TorchTrainer`, Ray Data `map_batches`, and reporting with
`validate_function`.
> - **Monitoring & logging**: Note validating checkpoints as a primary
reporting use case.
> - **Dependencies**:
> - Add `s3torchconnector==1.4.3` (and transitively
`s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom
S3 uploads.
> - **CI**:
> - Include `ci/env/install-miniforge.sh` in
`ci/docker/base.ml.wanda.yaml` build context.
> - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ...
--solver classic` for Python <3.12.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5741d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
# Summary

Add validation to checkpoint doc. Also mention that async checkpoint
uploading requires you to create a long-lasting checkpoint directory.

# Testing

docbuild

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Documents checkpoint upload modes and async validation with new
examples and API refs, adds s3torchconnector deps, and updates CI conda
install to use the classic solver.
> 
> - **Docs (Train)**:
> - **Checkpoints guide**: Rename to "Saving, Validating, and Loading
Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom)
and async checkpoint validation with `validate_fn`; include new examples
and figure.
> - **API refs**: Add `train.CheckpointUploadMode` to
`doc/source/train/api/api.rst`.
> - **Doc examples**: Add `checkpoints.py` snippets for
`CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via
`TorchTrainer`, Ray Data `map_batches`, and reporting with
`validate_function`.
> - **Monitoring & logging**: Note validating checkpoints as a primary
reporting use case.
> - **Dependencies**:
> - Add `s3torchconnector==1.4.3` (and transitively
`s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom
S3 uploads.
> - **CI**:
> - Include `ci/env/install-miniforge.sh` in
`ci/docker/base.ml.wanda.yaml` build context.
> - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ...
--solver classic` for Python <3.12.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5741d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…project#57065)

# Summary

Add validation to checkpoint doc. Also mention that async checkpoint
uploading requires you to create a long-lasting checkpoint directory.

# Testing

docbuild

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Documents checkpoint upload modes and async validation with new
examples and API refs, adds s3torchconnector deps, and updates CI conda
install to use the classic solver.
> 
> - **Docs (Train)**:
> - **Checkpoints guide**: Rename to "Saving, Validating, and Loading
Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom)
and async checkpoint validation with `validate_fn`; include new examples
and figure.
> - **API refs**: Add `train.CheckpointUploadMode` to
`doc/source/train/api/api.rst`.
> - **Doc examples**: Add `checkpoints.py` snippets for
`CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via
`TorchTrainer`, Ray Data `map_batches`, and reporting with
`validate_function`.
> - **Monitoring & logging**: Note validating checkpoints as a primary
reporting use case.
> - **Dependencies**:
> - Add `s3torchconnector==1.4.3` (and transitively
`s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom
S3 uploads.
> - **CI**:
> - Include `ci/env/install-miniforge.sh` in
`ci/docker/base.ml.wanda.yaml` build context.
> - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ...
--solver classic` for Python <3.12.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5741d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…project#57065)

# Summary

Add validation to checkpoint doc. Also mention that async checkpoint
uploading requires you to create a long-lasting checkpoint directory.

# Testing

docbuild

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Documents checkpoint upload modes and async validation with new
examples and API refs, adds s3torchconnector deps, and updates CI conda
install to use the classic solver.
>
> - **Docs (Train)**:
> - **Checkpoints guide**: Rename to "Saving, Validating, and Loading
Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom)
and async checkpoint validation with `validate_fn`; include new examples
and figure.
> - **API refs**: Add `train.CheckpointUploadMode` to
`doc/source/train/api/api.rst`.
> - **Doc examples**: Add `checkpoints.py` snippets for
`CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via
`TorchTrainer`, Ray Data `map_batches`, and reporting with
`validate_function`.
> - **Monitoring & logging**: Note validating checkpoints as a primary
reporting use case.
> - **Dependencies**:
> - Add `s3torchconnector==1.4.3` (and transitively
`s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom
S3 uploads.
> - **CI**:
> - Include `ci/env/install-miniforge.sh` in
`ci/docker/base.ml.wanda.yaml` build context.
> - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ...
--solver classic` for Python <3.12.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5741d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…project#57065)

# Summary

Add validation to checkpoint doc. Also mention that async checkpoint
uploading requires you to create a long-lasting checkpoint directory.

# Testing

docbuild

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Documents checkpoint upload modes and async validation with new
examples and API refs, adds s3torchconnector deps, and updates CI conda
install to use the classic solver.
>
> - **Docs (Train)**:
> - **Checkpoints guide**: Rename to "Saving, Validating, and Loading
Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom)
and async checkpoint validation with `validate_fn`; include new examples
and figure.
> - **API refs**: Add `train.CheckpointUploadMode` to
`doc/source/train/api/api.rst`.
> - **Doc examples**: Add `checkpoints.py` snippets for
`CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via
`TorchTrainer`, Ray Data `map_batches`, and reporting with
`validate_function`.
> - **Monitoring & logging**: Note validating checkpoints as a primary
reporting use case.
> - **Dependencies**:
> - Add `s3torchconnector==1.4.3` (and transitively
`s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom
S3 uploads.
> - **CI**:
> - Include `ci/env/install-miniforge.sh` in
`ci/docker/base.ml.wanda.yaml` build context.
> - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ...
--solver classic` for Python <3.12.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5741d9. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs An issue or change related to documentation go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants