-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
Description
Add a user-configurable knob to control the maximum number of concurrent async checkpoint uploads per worker in Train v2, replacing the current hard-coded constant.
-
Large checkpoints or slow object stores can cause memory pressure if too many uploads run in parallel.
-
Users need to tune concurrency based on storage bandwidth, checkpoint size, and cluster resources.
-
Provides a clear, documented, per-run control rather than relying on a fixed constant.
-
Introduce
CheckpointUploadConfig(subclass ofCheckpointConfig) with:max_async_upload_threads: Optional[int] = None(None = use current internal default)
-
Also added robust retry and timeout behavior to Train v2 checkpoint uploads.
-
Wrapped
persist_current_checkpoint(...)inTrainContext._save_checkpointwith exponential backoff retries and a soft per-attempt timeout, s.t. it emits structured per-attempt logs for observability.