[train][v2] Make async checkpoint upload concurrency configurable (max_async_upload_threads)

### Description

Add a user-configurable knob to control the maximum number of concurrent async checkpoint uploads per worker in Train v2, replacing the current hard-coded constant.

- Large checkpoints or slow object stores can cause memory pressure if too many uploads run in parallel.
- Users need to tune concurrency based on storage bandwidth, checkpoint size, and cluster resources.
- Provides a clear, documented, per-run control rather than relying on a fixed constant.

- Introduce `CheckpointUploadConfig` (subclass of `CheckpointConfig`) with:
  - `max_async_upload_threads: Optional[int] = None` (None = use current internal default)


- Also added robust retry and timeout behavior to Train v2 checkpoint uploads.
- Wrapped `persist_current_checkpoint(...)` in `TrainContext._save_checkpoint` with exponential backoff retries and a soft per-attempt timeout, s.t. it emits structured per-attempt logs for observability.


[Implementation](https://github.com/ray-project/ray/pull/55859) PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][v2] Make async checkpoint upload concurrency configurable (max_async_upload_threads) #55861

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[train][v2] Make async checkpoint upload concurrency configurable (max_async_upload_threads) #55861

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions