[Data] Simplify execution callback lifecycle

Here's the current lifecycle of callbacks for a single `Dataset` execution:
1. During planning, the planner adds the `LoadCheckpointCallback` to the data context
2. During execution, the executor calls `get_execution_callbacks`
    1.  On the first call, the function adds `ExecutionIdxUpdateCallback`, `IssueDetectionExecutionCallback`, and any callbacks configured in `EXECUTION_CALLBACKS_ENV_VAR` to the data context, and marks a flag on the data context to indicate that the callbacks have been initialized.
    2. On subsequent calls,  the function just gets the callbacks stored in the data context.

This design isn't ideal because:
1. It leaks callback-related state in the data context (e.g., whether callbacks have been initialized). 
2. Callbacks are constructed in multiple phases of execution (during planning and lazily during execution). This makes it harder to reason about the callback lifecycle.

An alternative design might be:
1. DataContext is read-only. It stored stores factories for the default user-provided or internal callbacks, but doesn't store state like whether those callbacks have been initialized.
2. The streaming executors eagerly constructs all of the callbacks once in the constructor.
3. Rather than conditionally adding `LoadCheckpointCallback` in the planner, it can be refactored so it no-ops if it's not relevant  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Simplify execution callback lifecycle #60279

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Simplify execution callback lifecycle #60279

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions