[train][v2] Add CheckpointUploadConfig(max_async_upload_threads) via RunConfig.checkpoint_config#55859
[train][v2] Add CheckpointUploadConfig(max_async_upload_threads) via RunConfig.checkpoint_config#55859kushalthaman wants to merge 3 commits intoray-project:masterfrom
Conversation
kushalthaman
commented
Aug 23, 2025
- Adds CheckpointUploadConfig( max_async_upload_threads: Optional[int]) as a subclass of CheckpointConfig
- Validates that value is None or >= 1.
…RunConfig.checkpoint_config Signed-off-by: Kushal Thaman <kushalt@Kushals-MacBook-Pro.local>
There was a problem hiding this comment.
Code Review
This pull request introduces CheckpointUploadConfig to allow configuring max_async_upload_threads for checkpointing. The change is well-contained and follows existing patterns. I've suggested a small improvement to the validation logic to make it more robust by explicitly checking for an integer type, which aligns with the error message and prevents potential issues with non-integer inputs.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kushal Thaman <62183606+kushalthaman@users.noreply.github.com>
… and per-attempt timeout\n\n- Wire retry/timeout around persist_current_checkpoint in TrainContext._save_checkpoint\n- Structured per-attempt logs (start/success/retry/timeout/exhausted)\n- Expose knobs in CheckpointUploadConfig: upload_retry_attempts, initial_delay, backoff_factor, attempt_timeout\n- No behavior change if not configured (defaults safe) Signed-off-by: Kushal Thaman <kushalt@Kushals-MacBook-Pro.local>
TimothySeah
left a comment
There was a problem hiding this comment.
Thanks for doing this! @justinvyu @matthewdeng fyi.
|
|
||
| @dataclass | ||
| @PublicAPI(stability="alpha") | ||
| class CheckpointUploadConfig(CheckpointConfig): |
There was a problem hiding this comment.
Can you put this in https://github.com/ray-project/ray/blob/master/python/ray/train/v2/api/config.py instead? We are moving away from the air folder.
| ) | ||
| if self.upload_retry_attempts < 0: | ||
| raise ValueError("upload_retry_attempts must be >= 0") | ||
| if self.upload_retry_initial_delay_s <= 0: |
There was a problem hiding this comment.
Can delay be 0 i.e. retry immediately?
| if self.upload_retry_initial_delay_s <= 0: | ||
| raise ValueError("upload_retry_initial_delay_s must be > 0") | ||
| if self.upload_retry_backoff_factor <= 0: | ||
| raise ValueError("upload_retry_backoff_factor must be > 0") |
There was a problem hiding this comment.
should this be >1 since it's always at least 1 anyway: https://github.com/ray-project/ray/pull/55859/files#diff-1c0e311e01c25e2cba8758268cc999b6029b1e03040821f6bb6b1cf6c1debcc6R289?
| delay_s = max(0.0, float(initial_delay_s)) | ||
| last_exc: Optional[Exception] = None | ||
|
|
||
| while True: |
There was a problem hiding this comment.
Would suggest using a retry library instead. @khluu I see you've worked on ray's pip dependencies - what library would you suggest for retries in Python?
There was a problem hiding this comment.
Actually @kushalthaman can you use https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/util.py#L1417 instead? We want to minimize the number of pip dependencies we add.
| persisted_checkpoint = self.storage_context.persist_current_checkpoint( | ||
| checkpoint, checkpoint_dir_name | ||
| # Extract retry/timeout config if available on the RunConfig. | ||
| run_cfg = self.train_run_context.get_run_config() |
There was a problem hiding this comment.
If we put CheckpointConfig in ray train v2, CheckpointConfig will always have these attributes.
|
|
||
| while True: | ||
| start_ts = time.monotonic() | ||
| logger.info( |
There was a problem hiding this comment.
this is a bit verbose - maybe logger.debug is better
| ) | ||
| try: | ||
| # Persist the checkpoint to the remote storage path. | ||
| persisted_checkpoint = self.storage_context.persist_current_checkpoint( |
There was a problem hiding this comment.
persist_current_checkpoint runs in the same thread - we can either figure out how to make pyarrow.fs.copy_files time out or just not handle timeouts for now
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |