[Data][Docs](draft): Document job-level checkpointing#60289
[Data][Docs](draft): Document job-level checkpointing#60289anonihunter wants to merge 1 commit intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds documentation for the job-level checkpointing feature in Ray Data. The changes include adding a new section to the batch inference guide and a more detailed section with a code example to the execution configurations guide. The documentation is clear and well-written. I have one suggestion to improve the code example to better reflect best practices for checkpoint storage in a distributed environment.
| checkpoint_path="/tmp/ray_data_checkpoint", | ||
| delete_checkpoint_on_success=False, |
There was a problem hiding this comment.
The example for checkpoint_path uses /tmp/ray_data_checkpoint, which can be misleading. Checkpoints must be stored in a location accessible by all nodes in the cluster, such as a cloud storage path (e.g., S3) or a network file system (NFS). Using /tmp will only work on a single-node cluster and is not a good practice for production workloads. I suggest updating the example to use a more realistic path and adding comments to clarify the requirements for checkpoint_path and the purpose of delete_checkpoint_on_success.
| checkpoint_path="/tmp/ray_data_checkpoint", | |
| delete_checkpoint_on_success=False, | |
| # The checkpoint path must be accessible from all nodes in the cluster | |
| # (e.g., a cloud storage path). | |
| checkpoint_path="s3://my-bucket/ray_data_checkpoints", | |
| # Keep checkpoint files after job success for inspection. | |
| delete_checkpoint_on_success=False, |
07dcbb5 to
26d3139
Compare
…ointing Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
26d3139 to
0e4b599
Compare
| Job-level checkpointing can be used to make offline batch inference jobs | ||
| resilient to failures such as node restarts or transient execution errors. |
There was a problem hiding this comment.
Active voice
| Job-level checkpointing can be used to make offline batch inference jobs | |
| resilient to failures such as node restarts or transient execution errors. | |
| Use job-level checkpointing to make offline batch inference jobs resilient to failures | |
| like node restarts or transient execution errors. |
| When enabled, Ray Data records progress during execution. If a batch inference | ||
| job fails partway through processing, rerunning the same pipeline with the same | ||
| checkpoint configuration will resume from the last completed checkpoint instead | ||
| of reprocessing all records. |
There was a problem hiding this comment.
| When enabled, Ray Data records progress during execution. If a batch inference | |
| job fails partway through processing, rerunning the same pipeline with the same | |
| checkpoint configuration will resume from the last completed checkpoint instead | |
| of reprocessing all records. | |
| When enabled, Ray Data records progress during execution. If a batch inference | |
| job fails partway through processing, rerunning the same pipeline with the same | |
| checkpoint configuration resumes from the last completed checkpoint instead | |
| of reprocessing all records. |
| Job-level checkpointing is configured through the | ||
| :class:`~ray.data.checkpoint.CheckpointConfig` and is set on the current | ||
| :class:`~ray.data.DataContext`. |
There was a problem hiding this comment.
Active voice
| Job-level checkpointing is configured through the | |
| :class:`~ray.data.checkpoint.CheckpointConfig` and is set on the current | |
| :class:`~ray.data.DataContext`. | |
| To configure job-level checkpointing, specify a | |
| :class:`~ray.data.checkpoint.CheckpointConfig` on the current | |
| :class:`~ray.data.DataContext`. |
| .. The checkpoint path must be accessible by all nodes in the Ray cluster. | ||
| .. delete_checkpoint_on_success=False preserves checkpoints after successful runs. |
There was a problem hiding this comment.
Is this intended to be visible to the reader?
| Ray Data supports job-level checkpointing to improve fault tolerance for | ||
| long-running batch pipelines. When enabled, Ray Data can resume a failed job | ||
| from the last successfully processed records instead of restarting from the | ||
| beginning. |
There was a problem hiding this comment.
Here and elsewhere -- for consistency, could we use the term "rows" instead of "records"? I think we typically use "row" in the Ray Data documentation
| Ray Data supports job-level checkpointing to improve fault tolerance for | |
| long-running batch pipelines. When enabled, Ray Data can resume a failed job | |
| from the last successfully processed records instead of restarting from the | |
| beginning. | |
| Ray Data supports job-level checkpointing to improve fault tolerance for | |
| long-running batch pipelines. When enabled, Ray Data can resume a failed job | |
| from the last successfully processed rows instead of restarting from the | |
| beginning. |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
Hey @anonihunter , this PR has been stale for a while, so I've asked @yuhuan130 to take this over the finish line. When I merge the PR, I'll include you as a co-author so you get credit |
|
Thanks for the update, really appreciate you taking this forward 🙌 |
This PR documents Ray Data job-level checkpointing functionality added in #59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes #60289 Fixes #60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
This draft PR documents Ray Data job-level checkpointing that was added in #59409.
The documentation adds a short explanation of job-level checkpointing and how it
applies to offline batch inference, with a small configuration example.
Sections touched:
Feedback welcome.
Fixes #60250