Skip to content

[train][doc] Add checkpoint_upload_mode to checkpoint docs#56860

Merged
justinvyu merged 19 commits intoray-project:masterfrom
TimothySeah:tseah/doc-checkpoint-upload-mode
Oct 2, 2025
Merged

[train][doc] Add checkpoint_upload_mode to checkpoint docs#56860
justinvyu merged 19 commits intoray-project:masterfrom
TimothySeah:tseah/doc-checkpoint-upload-mode

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Sep 24, 2025

Summary

Add checkpoint_upload_mode to checkpoint docs

Testing

docbuild


Note

Adds docs and examples for checkpoint upload modes (SYNC/ASYNC/NO_UPLOAD), updates API refs, includes s3torchconnector deps, and tweaks CI to use classic conda solver for mpi4py.

  • Docs (Train Checkpoints)
    • Add CheckpointUploadMode with guidance and examples for SYNC, ASYNC, and NO_UPLOAD in doc/source/train/user-guides/checkpoints.rst and doc_code/checkpoints.py.
    • Reference new figures and code snippets; update API refs to include ~train.CheckpointUploadMode in doc/source/train/api/api.rst.
  • Dependencies
    • Add s3torchconnector==1.4.3 (and s3torchconnectorclient==1.4.3) in python/requirements_compiled.txt; include s3torchconnector in python/requirements/ml/train-test-requirements.txt.
  • CI
    • Use classic conda solver for mpi4py install in ci/env/install-miniforge.sh.
    • Include ci/env/install-miniforge.sh in ci/docker/base.ml.wanda.yaml build context.

Written by Cursor Bugbot for commit 9dcf683. This will update automatically on new commits. Configure here.

@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Sep 24, 2025
@TimothySeah TimothySeah marked this pull request as ready for review September 24, 2025 18:34
@TimothySeah TimothySeah requested review from a team as code owners September 24, 2025 18:34
@ray-gardener ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue labels Sep 24, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an explicit "Asynchronous checkpointing" header?

@TimothySeah TimothySeah requested a review from a team as a code owner September 26, 2025 18:58
@aslonnie aslonnie requested review from elliot-barn and removed request for a team September 26, 2025 22:49
the next training step can start in parallel. If so, you should use
``ray.train.CheckpointUploadMode.ASYNC``. This is helpful for larger
checkpoints that might take longer to upload, but might add unnecessary
complexity if you want to immediately upload a small checkpoint.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to elaborate more on what the tradeoffs are between async and sync. Such as what the complexities might be, is there a significant performance or memory tradeoff, or is async usually just the better option

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added which kicks off a new thread to upload the checkpoint and moved Each ``report`` blocks until the previous ``report``\'s checkpoint upload completes before starting a new checkpoint upload thread. Ray Train does this to avoid accumulating too many upload threads and potentially running out of memory. to the main body.

What I'm going for is:

  • complexities: kicking off a thread + the report blocking thing above
  • performance: only worthwhile if checkpoint is big
  • memory: more memory from new thread, but this is capped at 1 thread as per the comment above

Lmk if the new wording addresses these points.

Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment about the figure

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
TimothySeah and others added 10 commits October 2, 2025 11:01
Signed-off-by: Timothy Seah <tseah@anyscale.com>
This reverts commit f073cef.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah force-pushed the tseah/doc-checkpoint-upload-mode branch from 9dcf683 to b5b05fc Compare October 2, 2025 18:01
@justinvyu justinvyu merged commit 77b1dcf into ray-project:master Oct 2, 2025
6 checks passed
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ct#56860)

Add checkpoint_upload_mode to checkpoint docs

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs An issue or change related to documentation go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants