Skip to content

[Data][Docs](draft): Document job-level checkpointing#60289

Closed
anonihunter wants to merge 1 commit intoray-project:masterfrom
anonihunter:docs-job-level-checkpointing
Closed

[Data][Docs](draft): Document job-level checkpointing#60289
anonihunter wants to merge 1 commit intoray-project:masterfrom
anonihunter:docs-job-level-checkpointing

Conversation

@anonihunter
Copy link
Copy Markdown
Contributor

@anonihunter anonihunter commented Jan 19, 2026

This draft PR documents Ray Data job-level checkpointing that was added in #59409.

The documentation adds a short explanation of job-level checkpointing and how it
applies to offline batch inference, with a small configuration example.

Sections touched:

  • Execution Configurations
  • End-to-end: Offline Batch Inference

Feedback welcome.

Fixes #60250

@anonihunter anonihunter requested a review from a team as a code owner January 19, 2026 08:59
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds documentation for the job-level checkpointing feature in Ray Data. The changes include adding a new section to the batch inference guide and a more detailed section with a code example to the execution configurations guide. The documentation is clear and well-written. I have one suggestion to improve the code example to better reflect best practices for checkpoint storage in a distributed environment.

Comment on lines +93 to +94
checkpoint_path="/tmp/ray_data_checkpoint",
delete_checkpoint_on_success=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for checkpoint_path uses /tmp/ray_data_checkpoint, which can be misleading. Checkpoints must be stored in a location accessible by all nodes in the cluster, such as a cloud storage path (e.g., S3) or a network file system (NFS). Using /tmp will only work on a single-node cluster and is not a good practice for production workloads. I suggest updating the example to use a more realistic path and adding comments to clarify the requirements for checkpoint_path and the purpose of delete_checkpoint_on_success.

Suggested change
checkpoint_path="/tmp/ray_data_checkpoint",
delete_checkpoint_on_success=False,
# The checkpoint path must be accessible from all nodes in the cluster
# (e.g., a cloud storage path).
checkpoint_path="s3://my-bucket/ray_data_checkpoints",
# Keep checkpoint files after job success for inspection.
delete_checkpoint_on_success=False,

@anonihunter anonihunter force-pushed the docs-job-level-checkpointing branch from 07dcbb5 to 26d3139 Compare January 19, 2026 09:09
…ointing

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
@anonihunter anonihunter force-pushed the docs-job-level-checkpointing branch from 26d3139 to 0e4b599 Compare January 19, 2026 09:58
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues community-contribution Contributed by the community labels Jan 19, 2026
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty for the contribution!

Comment on lines +242 to +243
Job-level checkpointing can be used to make offline batch inference jobs
resilient to failures such as node restarts or transient execution errors.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active voice

Suggested change
Job-level checkpointing can be used to make offline batch inference jobs
resilient to failures such as node restarts or transient execution errors.
Use job-level checkpointing to make offline batch inference jobs resilient to failures
like node restarts or transient execution errors.

Comment on lines +245 to +248
When enabled, Ray Data records progress during execution. If a batch inference
job fails partway through processing, rerunning the same pipeline with the same
checkpoint configuration will resume from the last completed checkpoint instead
of reprocessing all records.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When enabled, Ray Data records progress during execution. If a batch inference
job fails partway through processing, rerunning the same pipeline with the same
checkpoint configuration will resume from the last completed checkpoint instead
of reprocessing all records.
When enabled, Ray Data records progress during execution. If a batch inference
job fails partway through processing, rerunning the same pipeline with the same
checkpoint configuration resumes from the last completed checkpoint instead
of reprocessing all records.

Comment on lines +79 to +81
Job-level checkpointing is configured through the
:class:`~ray.data.checkpoint.CheckpointConfig` and is set on the current
:class:`~ray.data.DataContext`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active voice

Suggested change
Job-level checkpointing is configured through the
:class:`~ray.data.checkpoint.CheckpointConfig` and is set on the current
:class:`~ray.data.DataContext`.
To configure job-level checkpointing, specify a
:class:`~ray.data.checkpoint.CheckpointConfig` on the current
:class:`~ray.data.DataContext`.

Comment on lines +97 to +98
.. The checkpoint path must be accessible by all nodes in the Ray cluster.
.. delete_checkpoint_on_success=False preserves checkpoints after successful runs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be visible to the reader?

Comment on lines +74 to +77
Ray Data supports job-level checkpointing to improve fault tolerance for
long-running batch pipelines. When enabled, Ray Data can resume a failed job
from the last successfully processed records instead of restarting from the
beginning.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere -- for consistency, could we use the term "rows" instead of "records"? I think we typically use "row" in the Ray Data documentation

Suggested change
Ray Data supports job-level checkpointing to improve fault tolerance for
long-running batch pipelines. When enabled, Ray Data can resume a failed job
from the last successfully processed records instead of restarting from the
beginning.
Ray Data supports job-level checkpointing to improve fault tolerance for
long-running batch pipelines. When enabled, Ray Data can resume a failed job
from the last successfully processed rows instead of restarting from the
beginning.

@bveeramani bveeramani changed the title docs(draft): document job-level checkpointing [Data][Docs](draft): Document job-level checkpointing Jan 27, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 10, 2026
@bveeramani
Copy link
Copy Markdown
Member

Hey @anonihunter , this PR has been stale for a while, so I've asked @yuhuan130 to take this over the finish line.

When I merge the PR, I'll include you as a co-author so you get credit

@bveeramani bveeramani closed this Feb 10, 2026
@anonihunter
Copy link
Copy Markdown
Contributor Author

Thanks for the update, really appreciate you taking this forward 🙌
I wasn’t at my working desk for a bit, so I couldn’t quickly apply the requested tweaks.
Happy to be included as a co-author — thanks for that!

bveeramani added a commit that referenced this pull request Feb 11, 2026
This PR documents Ray Data job-level checkpointing functionality added
in #59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes #60289

Fixes #60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
preneond pushed a commit to preneond/ray that referenced this pull request Feb 15, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Feb 17, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
preneond pushed a commit to preneond/ray that referenced this pull request Feb 17, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
MuhammadSaif700 pushed a commit to MuhammadSaif700/ray that referenced this pull request Feb 17, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
Kunchd pushed a commit to Kunchd/ray that referenced this pull request Feb 17, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
preneond pushed a commit to preneond/ray that referenced this pull request Mar 23, 2026
…0921)

This PR documents Ray Data job-level checkpointing functionality added
in ray-project#59409.

Adds documentation explaining job-level checkpointing and its
application to offline batch inference, including configuration
examples.

**Sections modified:**
- Execution Configurations
- End-to-end: Offline Batch Inference

Supersedes ray-project#60289

Fixes ray-project#60250

---------

Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues docs An issue or change related to documentation stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add documentation for Ray Data checkpointing

3 participants