-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Python] Add low-level pyarrow bindings for Acero #33976
Description
The Acero query engine is one of the few parts of the Arrow C++ libraries that is not directly exposed in the pyarrow bindings. While you can already run queries using Acero through a substrait plan (pyarrow.substrait.run_query; but that requires first building that plan, and limits to what Acero supports of Substrait), and Acero is already used under the hood in some Dataset methods (pyarrow.dataset.Dataset.join/sort_by, but that doesn't currently allow building up a full query), there are some benefits of having direct, low-level bindings:
- It would be useful for direct users (or downstream packages) of pyarrow that don't want / need to go through substrait (yet another thing to learn / implement)
- It would make it easier for developers to test new Acero functionality from python
- You can use features of Acero that are not (yet) available in substrait
- It could (potentially) give an interface for people developing custom exec nodes
- It would be useful for people that want to do things "right now" (e.g. 12.0.0) instead of waiting for Substrait (tooling) to mature
The idea would be to add a new pyarrow.acero module, with low level bindings for the Acero execution engine. The goal would not be to build a complete user-friendly query interface (like ibis or dplyr), but provide rather direct access to build up a query with the Declaration interface. Minimum abilities would be to construct a plan, inspect plans created through other means (eg from substrait), and execute a plan (to_table, to_reader, write). An initial goal would be to be able to use this interface ourselves in the places that currently have custom cython usage of the exec plan such as Dataset.join (i.e. refactoring _exec_plan.pyx).
Non-exhaustive list of subtasks:
- Basic API to create a query plan / declaration (including the options classes): GH-33976: [Python] Initial bindings for acero Declaration and ExecNodeOptions classes #34102
- Add additional nodes for scan / write datasets: GH-33976: [Python] Add scan ExecNode options #34530
- Replace the internal usage of execplan in Dataset join/sort_by methods: GH-33976: [Python] Refactor the internal usage of ExecPlan to use new Acero Declaration/ExecNodeOptions bindings #34401
- GH-33976: [Python] Expose pyarrow.acero as public submodule and add to reference docs #34760
- Improve usage of compute functions with (field) expressions
- Ability to use registered UDFs in the query (specify the function registry)
- Basic bindings for ExecPlan object? (not create it directly, but be able to inspect it when created from declaration or through substrait)
- (enable using custom exec nodes?)