Skip to content

Prepared physical plan reusage #14342

@askalt

Description

@askalt

Problem

While implementing the saving of prepared statements in our storage based on the DataFusion, we encountered the following issue:

It is inefficient to save the logical plan and rebuild the physical plan from it at execution time.
The physical planner and optimizer are quite heavy, and for large queries with many columns, building the physical plan can take 100-200 ms.

After reading the optimizer's code, I realized that it includes a lot of complex logic, which is generally quite difficult to optimize. Moreover, the time required for plan construction is an uncontrollable variable, as the optimizations can be arbitrarily complex in order to produce a good plan.

The current API does not allow reusing physical plans for multiple executions due to the following reasons:

  1. Metrics. Since metrics are stored within the plans, streams are forced to share a pointer to the same metrics during execution. If physical plans are reused, streams will compete for writing to the metrics, which leads to non-scalability.

  2. Lack of physical placeholders. This makes it impossible to substitute parameters during the execute(...) stage.

Possible solution

To address these issues, I created experimental patches that:

  1. Move metric storage out of Execution Plans into TaskContext. The set of metrics is associated with a node of the ExecutionPlan, for example, metrics can be associated with ProjectionExec.

  2. Introduce physical placeholders implemented as a PhysicalExpr trait. These placeholders are resolved at the execute(...) stage, fetching parameters from the TaskContext.

The experimental patches can be found here:

https://github.com/tarantool/datafusion/commits/askalt/physical-placehdolders/

Challenges related to physical placeholders

For example, since the values of placeholders are not known during the optimizer passes, certain optimizations that depend on these values cannot be performed. In some cases, rebuilding the plan based on the parameters could be beneficial.

These issues can either be addressed by the user or resolved directly in DataFusion at a later stage.

In total

I’m creating this issue to discuss:

  1. How do you view the problems mentioned above?
  2. Could the experimental patches be useful for upstream (I’m ready to refine them if needed)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions