Skip to content

[train][v2] Make async checkpoint upload concurrency configurable (max_async_upload_threads) #55861

@kushalthaman

Description

@kushalthaman

Description

Add a user-configurable knob to control the maximum number of concurrent async checkpoint uploads per worker in Train v2, replacing the current hard-coded constant.

  • Large checkpoints or slow object stores can cause memory pressure if too many uploads run in parallel.

  • Users need to tune concurrency based on storage bandwidth, checkpoint size, and cluster resources.

  • Provides a clear, documented, per-run control rather than relying on a fixed constant.

  • Introduce CheckpointUploadConfig (subclass of CheckpointConfig) with:

    • max_async_upload_threads: Optional[int] = None (None = use current internal default)
  • Also added robust retry and timeout behavior to Train v2 checkpoint uploads.

  • Wrapped persist_current_checkpoint(...) in TrainContext._save_checkpoint with exponential backoff retries and a soft per-attempt timeout, s.t. it emits structured per-attempt logs for observability.

Implementation PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    community-backlogenhancementRequest for new feature and/or capabilityperformancestabilitytrainRay Train Related IssuetriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions