[train] Add get_all_reported_checkpoints ConsistencyMode by TimothySeah · Pull Request #58271 · ray-project/ray

TimothySeah · 2025-10-29T03:02:45Z

Summary

get_all_reported_checkponts can have 2 different levels:

Return when all reported checkpoints have been assembled. Those were
the intended semantics before this PR, though there was a bug with
get_all_reported_checkpoints + async checkpointing in which we might not
wait for the most recently reported checkpoint to be assembled, which
this PR also fixes. This could be useful if users want to end training
after they have their desired checkpoint.
Return when all reported checkpoints have been validated. This is
useful for the original purpose of get_all_reported_checkpoints, which
was to wait until every single checkpoint has been reported/validated
before saving them to experiment tracking from the workers themselves
(not the driver).

This PR toggles between these semantics with the new CheckpointConsistencyMode enum.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>

python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py

gemini-code-assist

Code Review

This pull request introduces a CheckpointView enum to control the blocking behavior of get_all_reported_checkpoints, allowing for non-blocking, waiting for upload, or waiting for validation. The changes are well-implemented across the stack, from the public API down to the internal checkpoint manager. A comprehensive test case is added to validate the new semantics. This PR also includes an important bug fix where get_all_reported_checkpoints was using an incorrect index, potentially causing it to return prematurely. My feedback focuses on improving the clarity of the conditional logic for handling the new CheckpointView options.

python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Thanks!

I also thought about adding some utilities to query and wait for pending validations, so that we can keep get_all_reported_checkpoints() argument-less, but I'd rather go with this solution to avoid introducing another API. But did you have any ideas for APIs to inspect validations for the future?

python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py

python/ray/train/v2/api/train_fn_utils.py

python/ray/train/v2/tests/test_async_checkpointing_validation.py

TimothySeah · 2025-11-15T02:37:16Z

Thanks!

I also thought about adding some utilities to query and wait for pending validations, so that we can keep get_all_reported_checkpoints() argument-less, but I'd rather go with this solution to avoid introducing another API. But did you have any ideas for APIs to inspect validations for the future?

What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress?

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Thanks, a few minor comments

doc/source/train/api/api.rst

python/ray/train/v2/api/train_fn_utils.py

python/ray/train/v2/api/report_config.py

justinvyu · 2025-11-17T18:55:31Z

What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress?

Yeah, like getting state, seeing failed validations, waiting on specific validations.

…orted-checkpoints-consistency

TimothySeah · 2025-11-18T02:03:28Z

What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress?

Yeah, like getting state, seeing failed validations, waiting on specific validations.

Good point - filed a bug.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Nice! Before merging, do you have any other ideas for the enun namings? Want to make sure we're confident before releasing the public API.

I was thinking PENDING instead of LIVE, and COMMITTED instead of UPLOADED. WDYT?

TimothySeah · 2025-11-18T03:09:58Z

Nice! Before merging, do you have any other ideas for the enun namings? Want to make sure we're confident before releasing the public API.

I was thinking PENDING instead of LIVE, and COMMITTED instead of UPLOADED. WDYT?

Yeah I think COMMITTED is more accurate than UPLOADED.
I think PENDING is confusing because we aren't viewing ReportedCheckpoints that are pending - for example if the worker-side counter is 5 and we call get_all_reported_checkpoints with that enum value then we may get 4 ReportedCheckpoints; we don't actually see the 5th pending checkpoint. Wdyt of NONBLOCKING or CURRENT?

justinvyu · 2025-11-18T07:32:44Z

Do we actually need to introduce a LIVE mode? Wondering if we can just do COMMITTED and VALIDATED. COMMITTED was the previous behavior, and LIVE is a new mode that we didn't support before.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Great!

…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[train] Add get_all_reported_checkpoints CheckpointViews

315afa0

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner October 29, 2025 03:02

cursor bot reviewed Oct 29, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py Outdated Show resolved Hide resolved

ray-gardener bot added the train Ray Train Related Issue label Oct 29, 2025

add checkpointview to __init__ and docs

1dd92c7

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner November 11, 2025 02:16

justinvyu reviewed Nov 13, 2025

View reviewed changes

rename checkpointview and view

e294d69

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah changed the title ~~[train] Add get_all_reported_checkpoints CheckpointViews~~ [train] Add get_all_reported_checkpoints ConsistencyMode Nov 15, 2025

rename unit test

6243cf2

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu reviewed Nov 17, 2025

View reviewed changes

doc/source/train/api/api.rst Outdated Show resolved Hide resolved

python/ray/train/v2/api/train_fn_utils.py Show resolved Hide resolved

python/ray/train/v2/api/report_config.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into tseah/get-all-rep…

fe7db83

…orted-checkpoints-consistency

TimothySeah added 2 commits November 17, 2025 18:15

rename to checkpointconsistencymode

e097000

Signed-off-by: Timothy Seah <tseah@anyscale.com>

improve documentation

b3739d3

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah added the go add ONLY when ready to merge, run all tests label Nov 18, 2025

TimothySeah requested a review from justinvyu November 18, 2025 02:27

justinvyu approved these changes Nov 18, 2025

View reviewed changes

UPLOADED -> COMMITTED and remove LIVE

fa3870c

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu approved these changes Nov 19, 2025

View reviewed changes

justinvyu merged commit cbd8a4a into ray-project:master Nov 19, 2025
6 checks passed

Conversation

TimothySeah commented Oct 29, 2025 • edited by justinvyu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimothySeah commented Nov 15, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu commented Nov 17, 2025

Uh oh!

TimothySeah commented Nov 18, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

TimothySeah commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinvyu commented Nov 18, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Oct 29, 2025 •

edited by justinvyu

Loading

TimothySeah commented Nov 18, 2025 •

edited

Loading