-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticaldataRay Data-related issuesRay Data-related issues
Description
Here's the current lifecycle of callbacks for a single Dataset execution:
- During planning, the planner adds the
LoadCheckpointCallbackto the data context - During execution, the executor calls
get_execution_callbacks- On the first call, the function adds
ExecutionIdxUpdateCallback,IssueDetectionExecutionCallback, and any callbacks configured inEXECUTION_CALLBACKS_ENV_VARto the data context, and marks a flag on the data context to indicate that the callbacks have been initialized. - On subsequent calls, the function just gets the callbacks stored in the data context.
- On the first call, the function adds
This design isn't ideal because:
- It leaks callback-related state in the data context (e.g., whether callbacks have been initialized).
- Callbacks are constructed in multiple phases of execution (during planning and lazily during execution). This makes it harder to reason about the callback lifecycle.
An alternative design might be:
- DataContext is read-only. It stored stores factories for the default user-provided or internal callbacks, but doesn't store state like whether those callbacks have been initialized.
- The streaming executors eagerly constructs all of the callbacks once in the constructor.
- Rather than conditionally adding
LoadCheckpointCallbackin the planner, it can be refactored so it no-ops if it's not relevant
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticaldataRay Data-related issuesRay Data-related issues
Type
Projects
Status
In progress