Skip to content

[C++][Compute] Support special form in expression system #47374

@zanmato1984

Description

@zanmato1984

Describe the enhancement requested

This is the umbrella issue for supporting special form expression in Arrow compute.

Rationale

It all starts from #41094. The comment #41094 (comment) is the rationale.

To summarize a bit, consider expression if (a != 0) then (b / a) else 0. In current Arrow expression evaluation, it is a nested function call to if_else(ne(a, 0), div(b, a), 0), which eagerly evaluates all its parameters including the div(b, a) even if a can be 0, or more formally, this is a "strict binding" strategy [1]. This is unintuitive and problematic given the meaning of if_else. What we expect here is a non-strict binding strategy, "call-by-name" [2]. Most programming languages have exceptional support for a handful "special" expressions , e.g. Lisp calls those "special forms" [3]. And as a closer peer, Velox uses the same term [4]. (Thanks @felipecrv for all these references!)

It is especially tricky to support such special forms in a vectorized engine like Arrow. There are two main challenges:

  • For conditional evaluation like if cond then if_branch else else_branch, all the arguments are evaluated on a batch basis. However which branch to evaluate (if_branch VS. else_branch) is decided by cond on a row basis. We need a "selection vector" during evaluation to mask which rows in the batch to evaluate and which not - this is about not only the efficiency, but also the "visible side effect" (consider divide by zero thrown in a divisor != 0 branch).
  • In addition to the selection vector, the kernels must respect it. Currently none does. And it's almost impossible for us to make all existing kernels to respect the selection vector at once (and we probably never will). Thus we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner - the idea is to first "gather" selected rows from the batch into a new batch, evaluate the expression on the new batch, then "scatter" the result rows into the positions where they belong in the original batch.

Approach

I did try to solve this in #42106 last year, however neither the reviewer nor myself were quite satisfied with the solution back then. Since then I've been reworking on this off and on. Until now I'm confident about the new approach. It's a large amount of work, so I split the mission into several separate yet self-consistent concrete tasks to ease the review and be more friendly to potential community help.

  • [C++][Compute] "Scatter" vector functions #44393: We need this function for scattering gathered batch result into positions where they belong in the original batch.
  • [C++][Compute] Move scatter function into compute core #47375: As mentioned above, the take ("gather") and scatter functions play an essential role to support selective execution for arbitrary legacy (selection-vector-non-aware) kernels, thus they have to exist in libarrow. Function take already does. We need scatter.
  • [C++][Compute] Support selective execution for kernels #47376: As mentioned above, we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner.
  • Issue # placeholder: Introduce structures of special form into Arrow compute C++, and probably add if_else as the first one because it's almost the most wanted one. The evaluation of the special form will generate selection vector on the fly and ingest it into the evaluation of its arguments, leveraging the selective execution for them.
  • Add selective kernels to avoid the overhead of "general selective execution" (gather/scatter): e.g. divide might be the most common one.
  • Add other special forms on demand: other conditional (case_when, coalesce), boolean logical (logical and, or).
  • Add Python bindings for special forms.
  • Add bindings for other implementations: R, Substrait, etc.

[1] https://en.wikipedia.org/wiki/Evaluation_strategy#Strict_binding_strategies
[2] https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_name
[3] https://courses.cs.northwestern.edu/325/readings/special-forms.html
[4] https://facebookincubator.github.io/velox/develop/expression-evaluation.html

Component(s)

C++

Sub-issues

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions