-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
(note this is mostly from @peter-toth 's description on #8891 (comment))
As @peter-toth noted, the current TreeNode APIs are somewhat inconsistent which has the following challenges:
- adds a barrier to entry for newcomers to the DataFusion codebase
- The APIs sometimes can not (or do no) implement pruning / early tree traversal termination and therefore do more tree walking / transforming than necessary
- Increases code maintenance costs
Specific Challenges of the Current APIs
Some specific challenges with the current APIs. These can cause a newcommer (like me) some confusion:
-
The
apply/visitfunctions are capable to prune (skip) subtrees or return immediately (stop) but thetransformfunctions are not. To make it more confusionrewriteis also capable to to prune, but instead of using a common way to control recursion,visitandrewriteuse different enums with conflicting semantics.
See this PR (ConsolidateTreeNodetransform and rewrite APIs #8891) for details onVisitRecursionvsRewriteRecursion. -
There is this
Transformedenum whose purpose is to explicitely define if any change was made to a node. This enum is used intransformclousres but for some reason it is missing fromrewrite. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes.
See details in RemoveTransformedenum #8835. I believe the current state simply doesn't make sense and just causes confusion. -
rewritealso seems to be inconsistent withtransform_upandtransform_down.rewriteprobably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures fromtransform_upandtransform_downto define:- what transformation should be done before (pre-order) and after (post-order) transformation of the node's children
- and how the recursion should continue .
See this PR (Consolidate
TreeNodetransform and rewrite APIs #8891) for the details. IMO the currentTreeNodeRewriter.pre_visit()seems like a mess and its logic is hard to grasp.
Specific missing APIs
- Pruning capability of
transformfunctions:
Pruning would be very important as many of the transformations wouldn't require traversing on the whole tree and could improve performance a lot. - Payload propagation:
I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In Transform with payload #8664 with the help oftransform_with_payloadI not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement).
Describe the solution you'd like
The proposal is a general API like this:
/// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a
/// node while traversing the tree top-down (pre-order, before the node's children are
/// visited) while `f_up` is applied on a node while traversing the tree bottom-up
/// (post-order, after the the node's children are visited).
///
/// The `f_down` closure takes
/// - a `PD` type payload from its parent
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `PC` type payload to pass to `f_up`,
/// - a `Vec<FD>` type payload to propagate down to its children
/// (one `FD` element is propagated down to each child),
/// - a `TreeNodeRecursion` enum element to control recursion flow.
///
/// The `f_up` closure takes
/// - a `PC` type payload from `f_down` and
/// - a `Vec<PU>` type payload collected from its children
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `FU` type payload to propagate up to its parent,
/// - a `TreeNodeRecursion` enum element to control recursion flow.
fn transform_with_payload<FD, PD, PC, FU, PU>(
self,
f_down: &mut FD,
payload_down: PD,
f_up: &mut FU,
) -> Result<(Self, PU)>
where
FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>,
FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>,Obviously the closure return types can be extracted to a type alias or strucs like in the case of f_down could be:
pub struct TransformDownResult<N, PD, PC> {
pub transformed_node: N,
pub payload_to_children: Vec<PD>,
pub payload_to_f_up: PC,
pub recursion_control: TreeNodeRecursion,
}All the other functions can then be turned into a specialization of the above:
apply(visit_down): Onlyf_downclosure that doesn't return a modified node and payload (only retrurnsTreeNodeRecursion)visit/TreeNodeVisitor: Bothf_downandf_upclosures needed. The closures can be incorporated into aTreeNodeVisitorobject but they don't return a modified node and payloadtransform_down: Onlyf_downclosure that doesn't return payloadtransform_up: Onlyf_upclosure that doesn't return payloadtransform: Bothf_downandf_upclosures needed, but they don't return payloadsrewrite/TreeNodeRewriter: Bothf_downandf_upare incorporated into aTreeNodeRewriterobject, but they don't return a payloadtransform_down_with_payload: Onlyf_downclosuretransform_up_with_payload: Onlyf_upclosure
Describe alternatives you've considered
As noted above, there are a variety of PRs that test out various ideas in one way or another and the major challenge I think is figuring out what changes to make, in what order, and how much code churn to apply)
Additional context
Related tickets / PRs:
- Consolidate
TreeNodetransform and rewrite APIs #8891 - TreeNode refactor code deduplication: Part 3 #8817
- Transform with payload #8664
- Remove
Transformedenum #8835 - Simplify Expr::map_children #9876
- Consistent LogicalPlan subquery handling in TreeNode::apply and TreeNode::visit #9913
- Simplify TreeNode recursions #9965
- Consolidate
LogicalPlantree node walking / rewriting code in one place #9994 - Deprecate TreeNode
transform_xx_mutmethods #10097