[train] Add get_all_reported_checkpoints ConsistencyMode#58271
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Code Review
This pull request introduces a CheckpointView enum to control the blocking behavior of get_all_reported_checkpoints, allowing for non-blocking, waiting for upload, or waiting for validation. The changes are well-implemented across the stack, from the public API down to the internal checkpoint manager. A comprehensive test case is added to validate the new semantics. This PR also includes an important bug fix where get_all_reported_checkpoints was using an incorrect index, potentially causing it to return prematurely. My feedback focuses on improving the clarity of the conditional logic for handling the new CheckpointView options.
python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Thanks!
I also thought about adding some utilities to query and wait for pending validations, so that we can keep get_all_reported_checkpoints() argument-less, but I'd rather go with this solution to avoid introducing another API. But did you have any ideas for APIs to inspect validations for the future?
python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/tests/test_async_checkpointing_validation.py
Outdated
Show resolved
Hide resolved
What “inspect validations” PR’s should we support that aren’t covered? Get_all_reported_checkpoints can either wait until all validations are done or just return the currently finished ones. Controller logs also show the pending validations. Were you thinking stuff like canceling validations or viewing progress? |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Thanks, a few minor comments
Yeah, like getting state, seeing failed validations, waiting on specific validations. |
…orted-checkpoints-consistency
Good point - filed a bug. |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Nice! Before merging, do you have any other ideas for the enun namings? Want to make sure we're confident before releasing the public API.
I was thinking PENDING instead of LIVE, and COMMITTED instead of UPLOADED. WDYT?
Yeah I think COMMITTED is more accurate than UPLOADED. |
|
Do we actually need to introduce a LIVE mode? Wondering if we can just do COMMITTED and VALIDATED. COMMITTED was the previous behavior, and LIVE is a new mode that we didn't support before. |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Summary
get_all_reported_checkpontscan have 2 different levels:the intended semantics before this PR, though there was a bug with
get_all_reported_checkpoints + async checkpointing in which we might not
wait for the most recently reported checkpoint to be assembled, which
this PR also fixes. This could be useful if users want to end training
after they have their desired checkpoint.
useful for the original purpose of
get_all_reported_checkpoints, whichwas to wait until every single checkpoint has been reported/validated
before saving them to experiment tracking from the workers themselves
(not the driver).
This PR toggles between these semantics with the new
CheckpointConsistencyModeenum.Testing
Unit tests