Skip to content

[Data] Add Checkpoint/Resume Support for Ray Data Pipelines #55008

@dragongu

Description

@dragongu

Summary

Ray Data currently lacks built-in checkpointing functionality, which makes it challenging to recover from failures in long-running data processing pipelines. This feature request proposes adding checkpoint and resume capabilities to Ray Data to improve fault tolerance and reduce the cost of restarting large-scale data processing jobs.

Motivation

Large Ray Data pipelines can take hours or days to complete. When failures occur due to unconfigured retryable exceptions or bug, the entire pipeline must restart from the beginning, resulting in:

  1. High Costs
    • Significant GPU resource waste
    • Extended time-to-completion
  2. Operational Complexity
    • Users currently need to manually segment large jobs (e.g., splitting a single large job into 10 parts)
    • No built-in mechanism to preserve progress when jobs are interrupted
    • Cross-cluster job migration is not supported Proposed Solution: jobs must be migrated to other clusters/data centers when high-priority workloads require the urgently resources

Requirements Overview

  1. Job State Persistence to External Storage
  2. Cross-Cluster Resume

Use case

No response

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilitystability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions