Skip to content

[train] Add get_all_reported_checkpoints ConsistencyMode#58271

Merged
justinvyu merged 8 commits intoray-project:masterfrom
TimothySeah:tseah/get-all-reported-checkpoints-consistency
Nov 19, 2025
Merged

[train] Add get_all_reported_checkpoints ConsistencyMode#58271
justinvyu merged 8 commits intoray-project:masterfrom
TimothySeah:tseah/get-all-reported-checkpoints-consistency

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Oct 29, 2025

Summary

get_all_reported_checkponts can have 2 different levels:

  1. Return when all reported checkpoints have been assembled. Those were
    the intended semantics before this PR, though there was a bug with
    get_all_reported_checkpoints + async checkpointing in which we might not
    wait for the most recently reported checkpoint to be assembled, which
    this PR also fixes. This could be useful if users want to end training
    after they have their desired checkpoint.
  2. Return when all reported checkpoints have been validated. This is
    useful for the original purpose of get_all_reported_checkpoints, which
    was to wait until every single checkpoint has been reported/validated
    before saving them to experiment tracking from the workers themselves
    (not the driver).

This PR toggles between these semantics with the new CheckpointConsistencyMode enum.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner October 29, 2025 03:02
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a CheckpointView enum to control the blocking behavior of get_all_reported_checkpoints, allowing for non-blocking, waiting for upload, or waiting for validation. The changes are well-implemented across the stack, from the public API down to the internal checkpoint manager. A comprehensive test case is added to validate the new semantics. This PR also includes an important bug fix where get_all_reported_checkpoints was using an incorrect index, potentially causing it to return prematurely. My feedback focuses on improving the clarity of the conditional logic for handling the new CheckpointView options.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 29, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner November 11, 2025 02:16
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I also thought about adding some utilities to query and wait for pending validations, so that we can keep get_all_reported_checkpoints() argument-less, but I'd rather go with this solution to avoid introducing another API. But did you have any ideas for APIs to inspect validations for the future?

@TimothySeah
Copy link
Copy Markdown
Contributor Author

Thanks!

I also thought about adding some utilities to query and wait for pending validations, so that we can keep get_all_reported_checkpoints() argument-less, but I'd rather go with this solution to avoid introducing another API. But did you have any ideas for APIs to inspect validations for the future?

What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress?

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah changed the title [train] Add get_all_reported_checkpoints CheckpointViews [train] Add get_all_reported_checkpoints ConsistencyMode Nov 15, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few minor comments

@justinvyu
Copy link
Copy Markdown
Contributor

What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress?

Yeah, like getting state, seeing failed validations, waiting on specific validations.

@TimothySeah
Copy link
Copy Markdown
Contributor Author

What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress?

Yeah, like getting state, seeing failed validations, waiting on specific validations.

Good point - filed a bug.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Nov 18, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Before merging, do you have any other ideas for the enun namings? Want to make sure we're confident before releasing the public API.

I was thinking PENDING instead of LIVE, and COMMITTED instead of UPLOADED. WDYT?

@TimothySeah
Copy link
Copy Markdown
Contributor Author

TimothySeah commented Nov 18, 2025

Nice! Before merging, do you have any other ideas for the enun namings? Want to make sure we're confident before releasing the public API.

I was thinking PENDING instead of LIVE, and COMMITTED instead of UPLOADED. WDYT?

Yeah I think COMMITTED is more accurate than UPLOADED.
I think PENDING is confusing because we aren't viewing ReportedCheckpoints that are pending - for example if the worker-side counter is 5 and we call get_all_reported_checkpoints with that enum value then we may get 4 ReportedCheckpoints; we don't actually see the 5th pending checkpoint. Wdyt of NONBLOCKING or CURRENT?

@justinvyu
Copy link
Copy Markdown
Contributor

Do we actually need to introduce a LIVE mode? Wondering if we can just do COMMITTED and VALIDATED. COMMITTED was the previous behavior, and LIVE is a new mode that we didn't support before.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@justinvyu justinvyu merged commit cbd8a4a into ray-project:master Nov 19, 2025
6 checks passed
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
…#58271)

`get_all_reported_checkponts` can have 2 different levels:
1) Return when all reported checkpoints have been assembled. Those were
the intended semantics before this PR, though there was a bug with
get_all_reported_checkpoints + async checkpointing in which we might not
wait for the most recently reported checkpoint to be assembled, which
this PR also fixes. This could be useful if users want to end training
after they have their desired checkpoint.
2) Return when all reported checkpoints have been validated. This is
useful for the original purpose of `get_all_reported_checkpoints`, which
was to wait until every single checkpoint has been reported/validated
before saving them to experiment tracking from the workers themselves
(not the driver).

This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…#58271)

`get_all_reported_checkponts` can have 2 different levels:
1) Return when all reported checkpoints have been assembled. Those were
the intended semantics before this PR, though there was a bug with
get_all_reported_checkpoints + async checkpointing in which we might not
wait for the most recently reported checkpoint to be assembled, which
this PR also fixes. This could be useful if users want to end training
after they have their desired checkpoint.
2) Return when all reported checkpoints have been validated. This is
useful for the original purpose of `get_all_reported_checkpoints`, which
was to wait until every single checkpoint has been reported/validated
before saving them to experiment tracking from the workers themselves
(not the driver).

This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…#58271)

`get_all_reported_checkponts` can have 2 different levels:
1) Return when all reported checkpoints have been assembled. Those were
the intended semantics before this PR, though there was a bug with
get_all_reported_checkpoints + async checkpointing in which we might not
wait for the most recently reported checkpoint to be assembled, which
this PR also fixes. This could be useful if users want to end training
after they have their desired checkpoint.
2) Return when all reported checkpoints have been validated. This is
useful for the original purpose of `get_all_reported_checkpoints`, which
was to wait until every single checkpoint has been reported/validated
before saving them to experiment tracking from the workers themselves
(not the driver).

This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…#58271)

`get_all_reported_checkponts` can have 2 different levels:
1) Return when all reported checkpoints have been assembled. Those were
the intended semantics before this PR, though there was a bug with
get_all_reported_checkpoints + async checkpointing in which we might not
wait for the most recently reported checkpoint to be assembled, which
this PR also fixes. This could be useful if users want to end training
after they have their desired checkpoint.
2) Return when all reported checkpoints have been validated. This is
useful for the original purpose of `get_all_reported_checkpoints`, which
was to wait until every single checkpoint has been reported/validated
before saving them to experiment tracking from the workers themselves
(not the driver).

This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants