Skip to content

[Data] Simplify execution callback lifecycle #60279

@bveeramani

Description

@bveeramani

Here's the current lifecycle of callbacks for a single Dataset execution:

  1. During planning, the planner adds the LoadCheckpointCallback to the data context
  2. During execution, the executor calls get_execution_callbacks
    1. On the first call, the function adds ExecutionIdxUpdateCallback, IssueDetectionExecutionCallback, and any callbacks configured in EXECUTION_CALLBACKS_ENV_VAR to the data context, and marks a flag on the data context to indicate that the callbacks have been initialized.
    2. On subsequent calls, the function just gets the callbacks stored in the data context.

This design isn't ideal because:

  1. It leaks callback-related state in the data context (e.g., whether callbacks have been initialized).
  2. Callbacks are constructed in multiple phases of execution (during planning and lazily during execution). This makes it harder to reason about the callback lifecycle.

An alternative design might be:

  1. DataContext is read-only. It stored stores factories for the default user-provided or internal callbacks, but doesn't store state like whether those callbacks have been initialized.
  2. The streaming executors eagerly constructs all of the callbacks once in the constructor.
  3. Rather than conditionally adding LoadCheckpointCallback in the planner, it can be refactored so it no-ops if it's not relevant

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticaldataRay Data-related issues

Type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions