-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Problem
While implementing the saving of prepared statements in our storage based on the DataFusion, we encountered the following issue:
It is inefficient to save the logical plan and rebuild the physical plan from it at execution time.
The physical planner and optimizer are quite heavy, and for large queries with many columns, building the physical plan can take 100-200 ms.
After reading the optimizer's code, I realized that it includes a lot of complex logic, which is generally quite difficult to optimize. Moreover, the time required for plan construction is an uncontrollable variable, as the optimizations can be arbitrarily complex in order to produce a good plan.
The current API does not allow reusing physical plans for multiple executions due to the following reasons:
-
Metrics. Since metrics are stored within the plans, streams are forced to share a pointer to the same metrics during execution. If physical plans are reused, streams will compete for writing to the metrics, which leads to non-scalability.
-
Lack of physical placeholders. This makes it impossible to substitute parameters during the execute(...) stage.
Possible solution
To address these issues, I created experimental patches that:
-
Move metric storage out of Execution Plans into
TaskContext. The set of metrics is associated with a node of the ExecutionPlan, for example, metrics can be associated withProjectionExec. -
Introduce physical placeholders implemented as a
PhysicalExprtrait. These placeholders are resolved at the execute(...) stage, fetching parameters from theTaskContext.
The experimental patches can be found here:
https://github.com/tarantool/datafusion/commits/askalt/physical-placehdolders/
Challenges related to physical placeholders
For example, since the values of placeholders are not known during the optimizer passes, certain optimizations that depend on these values cannot be performed. In some cases, rebuilding the plan based on the parameters could be beneficial.
These issues can either be addressed by the user or resolved directly in DataFusion at a later stage.
In total
I’m creating this issue to discuss:
- How do you view the problems mentioned above?
- Could the experimental patches be useful for upstream (I’m ready to refine them if needed)?