-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitystability
Description
Summary
Ray Data currently lacks built-in checkpointing functionality, which makes it challenging to recover from failures in long-running data processing pipelines. This feature request proposes adding checkpoint and resume capabilities to Ray Data to improve fault tolerance and reduce the cost of restarting large-scale data processing jobs.
Motivation
Large Ray Data pipelines can take hours or days to complete. When failures occur due to unconfigured retryable exceptions or bug, the entire pipeline must restart from the beginning, resulting in:
- High Costs
- Significant GPU resource waste
- Extended time-to-completion
- Operational Complexity
- Users currently need to manually segment large jobs (e.g., splitting a single large job into 10 parts)
- No built-in mechanism to preserve progress when jobs are interrupted
- Cross-cluster job migration is not supported Proposed Solution: jobs must be migrated to other clusters/data centers when high-priority workloads require the urgently resources
Requirements Overview
- Job State Persistence to External Storage
- Cross-Cluster Resume
Use case
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitystability