Skip to content

[train][v2] Add CheckpointUploadConfig(max_async_upload_threads) via RunConfig.checkpoint_config#55859

Open
kushalthaman wants to merge 3 commits intoray-project:masterfrom
kushalthaman:v2-upload-threads
Open

[train][v2] Add CheckpointUploadConfig(max_async_upload_threads) via RunConfig.checkpoint_config#55859
kushalthaman wants to merge 3 commits intoray-project:masterfrom
kushalthaman:v2-upload-threads

Conversation

@kushalthaman
Copy link
Copy Markdown

  • Adds CheckpointUploadConfig( max_async_upload_threads: Optional[int]) as a subclass of CheckpointConfig
  • Validates that value is None or >= 1.

…RunConfig.checkpoint_config

Signed-off-by: Kushal Thaman <kushalt@Kushals-MacBook-Pro.local>
@kushalthaman kushalthaman requested a review from a team as a code owner August 23, 2025 05:33
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces CheckpointUploadConfig to allow configuring max_async_upload_threads for checkpointing. The change is well-contained and follows existing patterns. I've suggested a small improvement to the validation logic to make it more robust by explicitly checking for an integer type, which aligns with the error message and prevents potential issues with non-integer inputs.

kushalthaman and others added 2 commits August 22, 2025 22:41
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kushal Thaman <62183606+kushalthaman@users.noreply.github.com>
… and per-attempt timeout\n\n- Wire retry/timeout around persist_current_checkpoint in TrainContext._save_checkpoint\n- Structured per-attempt logs (start/success/retry/timeout/exhausted)\n- Expose knobs in CheckpointUploadConfig: upload_retry_attempts, initial_delay, backoff_factor, attempt_timeout\n- No behavior change if not configured (defaults safe)

Signed-off-by: Kushal Thaman <kushalt@Kushals-MacBook-Pro.local>
@kushalthaman kushalthaman requested a review from a team as a code owner August 23, 2025 06:16
@ray-gardener ray-gardener bot added train Ray Train Related Issue community-contribution Contributed by the community labels Aug 23, 2025
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! @justinvyu @matthewdeng fyi.


@dataclass
@PublicAPI(stability="alpha")
class CheckpointUploadConfig(CheckpointConfig):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in https://github.com/ray-project/ray/blob/master/python/ray/train/v2/api/config.py instead? We are moving away from the air folder.

)
if self.upload_retry_attempts < 0:
raise ValueError("upload_retry_attempts must be >= 0")
if self.upload_retry_initial_delay_s <= 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can delay be 0 i.e. retry immediately?

if self.upload_retry_initial_delay_s <= 0:
raise ValueError("upload_retry_initial_delay_s must be > 0")
if self.upload_retry_backoff_factor <= 0:
raise ValueError("upload_retry_backoff_factor must be > 0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delay_s = max(0.0, float(initial_delay_s))
last_exc: Optional[Exception] = None

while True:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest using a retry library instead. @khluu I see you've worked on ray's pip dependencies - what library would you suggest for retries in Python?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @kushalthaman can you use https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/util.py#L1417 instead? We want to minimize the number of pip dependencies we add.

persisted_checkpoint = self.storage_context.persist_current_checkpoint(
checkpoint, checkpoint_dir_name
# Extract retry/timeout config if available on the RunConfig.
run_cfg = self.train_run_context.get_run_config()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put CheckpointConfig in ray train v2, CheckpointConfig will always have these attributes.


while True:
start_ts = time.monotonic()
logger.info(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit verbose - maybe logger.debug is better

)
try:
# Persist the checkpoint to the remote storage path.
persisted_checkpoint = self.storage_context.persist_current_checkpoint(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist_current_checkpoint runs in the same thread - we can either figure out how to make pyarrow.fs.copy_files time out or just not handle timeouts for now

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 10, 2025
@kushalthaman

This comment was marked as outdated.

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community train Ray Train Related Issue unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants