-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
This is the umbrella issue for supporting special form expression in Arrow compute.
Rationale
It all starts from #41094. The comment #41094 (comment) is the rationale.
To summarize a bit, consider expression if (a != 0) then (b / a) else 0. In current Arrow expression evaluation, it is a nested function call to if_else(ne(a, 0), div(b, a), 0), which eagerly evaluates all its parameters including the div(b, a) even if a can be 0, or more formally, this is a "strict binding" strategy [1]. This is unintuitive and problematic given the meaning of if_else. What we expect here is a non-strict binding strategy, "call-by-name" [2]. Most programming languages have exceptional support for a handful "special" expressions , e.g. Lisp calls those "special forms" [3]. And as a closer peer, Velox uses the same term [4]. (Thanks @felipecrv for all these references!)
It is especially tricky to support such special forms in a vectorized engine like Arrow. There are two main challenges:
- For conditional evaluation like
if cond then if_branch else else_branch, all the arguments are evaluated on a batch basis. However which branch to evaluate (if_branchVS.else_branch) is decided bycondon a row basis. We need a "selection vector" during evaluation to mask which rows in the batch to evaluate and which not - this is about not only the efficiency, but also the "visible side effect" (considerdivide by zerothrown in adivisor != 0branch). - In addition to the selection vector, the kernels must respect it. Currently none does. And it's almost impossible for us to make all existing kernels to respect the selection vector at once (and we probably never will). Thus we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner - the idea is to first "gather" selected rows from the batch into a new batch, evaluate the expression on the new batch, then "scatter" the result rows into the positions where they belong in the original batch.
Approach
I did try to solve this in #42106 last year, however neither the reviewer nor myself were quite satisfied with the solution back then. Since then I've been reworking on this off and on. Until now I'm confident about the new approach. It's a large amount of work, so I split the mission into several separate yet self-consistent concrete tasks to ease the review and be more friendly to potential community help.
- [C++][Compute] "Scatter" vector functions #44393: We need this function for scattering gathered batch result into positions where they belong in the original batch.
- [C++][Compute] Move
scatterfunction into compute core #47375: As mentioned above, thetake("gather") andscatterfunctions play an essential role to support selective execution for arbitrary legacy (selection-vector-non-aware) kernels, thus they have to exist in libarrow. Functiontakealready does. We needscatter. - [C++][Compute] Support selective execution for kernels #47376: As mentioned above, we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner.
- Issue # placeholder: Introduce structures of special form into Arrow compute C++, and probably add
if_elseas the first one because it's almost the most wanted one. The evaluation of the special form will generate selection vector on the fly and ingest it into the evaluation of its arguments, leveraging the selective execution for them. - Add selective kernels to avoid the overhead of "general selective execution" (gather/scatter): e.g.
dividemight be the most common one. - Add other special forms on demand: other conditional (
case_when,coalesce), boolean logical (logicaland,or). - Add Python bindings for special forms.
- Add bindings for other implementations: R, Substrait, etc.
[1] https://en.wikipedia.org/wiki/Evaluation_strategy#Strict_binding_strategies
[2] https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_name
[3] https://courses.cs.northwestern.edu/325/readings/special-forms.html
[4] https://facebookincubator.github.io/velox/develop/expression-evaluation.html
Component(s)
C++