[train][doc] Add checkpoint_upload_mode to checkpoint docs#56860
[train][doc] Add checkpoint_upload_mode to checkpoint docs#56860justinvyu merged 19 commits intoray-project:masterfrom
Conversation
justinvyu
left a comment
There was a problem hiding this comment.
can you add an explicit "Asynchronous checkpointing" header?
| the next training step can start in parallel. If so, you should use | ||
| ``ray.train.CheckpointUploadMode.ASYNC``. This is helpful for larger | ||
| checkpoints that might take longer to upload, but might add unnecessary | ||
| complexity if you want to immediately upload a small checkpoint. |
There was a problem hiding this comment.
It might be helpful to elaborate more on what the tradeoffs are between async and sync. Such as what the complexities might be, is there a significant performance or memory tradeoff, or is async usually just the better option
There was a problem hiding this comment.
Added which kicks off a new thread to upload the checkpoint and moved Each ``report`` blocks until the previous ``report``\'s checkpoint upload completes before starting a new checkpoint upload thread. Ray Train does this to avoid accumulating too many upload threads and potentially running out of memory. to the main body.
What I'm going for is:
- complexities: kicking off a thread + the report blocking thing above
- performance: only worthwhile if checkpoint is big
- memory: more memory from new thread, but this is capped at 1 thread as per the comment above
Lmk if the new wording addresses these points.
justinvyu
left a comment
There was a problem hiding this comment.
Just one comment about the figure
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
This reverts commit f073cef. Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
9dcf683 to
b5b05fc
Compare
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ct#56860) Add checkpoint_upload_mode to checkpoint docs --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Summary
Add checkpoint_upload_mode to checkpoint docs
Testing
docbuild
Note
Adds docs and examples for checkpoint upload modes (SYNC/ASYNC/NO_UPLOAD), updates API refs, includes s3torchconnector deps, and tweaks CI to use classic conda solver for mpi4py.
CheckpointUploadModewith guidance and examples forSYNC,ASYNC, andNO_UPLOADindoc/source/train/user-guides/checkpoints.rstanddoc_code/checkpoints.py.~train.CheckpointUploadModeindoc/source/train/api/api.rst.s3torchconnector==1.4.3(ands3torchconnectorclient==1.4.3) inpython/requirements_compiled.txt; includes3torchconnectorinpython/requirements/ml/train-test-requirements.txt.mpi4pyinstall inci/env/install-miniforge.sh.ci/env/install-miniforge.shinci/docker/base.ml.wanda.yamlbuild context.Written by Cursor Bugbot for commit 9dcf683. This will update automatically on new commits. Configure here.