Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Jul 23, 2021

  • An extensible registry of exec node factories (std::function<Result<ExecNode*>(ExecPlan* plan, std::vector<ExecNode*> inputs, const ExecNodeOptions& options)>) is provided
  • Hard coded factories like compute::MakeSinkNode, dataset::MakeScanNode are replaced by factories in the registry named "sink", "scan", etc
  • arrow::compute::Declaration is provided to represent an unconstructed set of ExecNodes, which can be validated and emplaced into an ExecPlan as a unit

@apache apache deleted a comment from github-actions bot Jul 23, 2021
@bkietz bkietz changed the title [WIP] Start refactoring away from hard coded ExecNode factories to a registry ARROW-13482: [C++][Compute] Refactoring away from hard coded ExecNode factories to a registry Jul 28, 2021
@github-actions
Copy link

@bkietz bkietz marked this pull request as ready for review July 28, 2021 21:08
@bkietz bkietz requested review from lidavidm and pitrou July 28, 2021 21:09
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Overall this looks cleaner than hand-constructing the graphs before, at least for the simple ones we have in tests.

Copy link
Contributor

@nirandaperera nirandaperera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could split the exec_plan.cc based on each ExecNode impl? It would be much easier to navigate then

@bkietz
Copy link
Member Author

bkietz commented Jul 30, 2021

Will do, thanks!

@ianmcook
Copy link
Member

ianmcook commented Aug 4, 2021

Is there any plan for documenting these ExecNode factories in a standard way? Does this registry give us opportunities to achieve ARROW-13227 in a modular way? Apologies if this is a silly question; I don't fully understand all the mechanics here.

@bkietz
Copy link
Member Author

bkietz commented Aug 4, 2021

@ianmcook I'll add a note to ARROW-13227 describing the impact of this PR. In short, ARROW-13227 should now document these factories instead of specific subclasses of ExecNode (and will look a lot like compute.rst).

@bkietz bkietz force-pushed the exec-node-factory-registry branch from 211b1da to 87c10c6 Compare August 4, 2021 20:45
@nealrichardson nealrichardson force-pushed the exec-node-factory-registry branch from 87c10c6 to 1b2274e Compare August 4, 2021 21:49
Added factory for SourceExecNode and replacing old MakeProjectNode by the new registry use in plan_test

Formatting..

add Declaration helper class to allow declarative construction of ExecPlans

msvc: export classes and structs

add operator| for dplyr-like pipelines

make labels optional for ExecNodes

remove operator|

add example of registering a custom ExecNode

msvc: try explicitly constructing std::function instead of relying on operator=

remove usage of hard coded factories from plan_test.cc

remove all usages of hard coded factory functions

nit: simplify Declaration::Sequence slightly

msvc: it's callable, really

msvc: export nested structs too

msvc: try the copy constructor, why not

add comments for Declaration

add support for AggregateNodeOptions::targets to ScalarAggregateNode

revert msvc: try the copy constructor, why not

extract exec node implementations to dedicated source files

add output field names to aggregate options
@bkietz bkietz force-pushed the exec-node-factory-registry branch from 9a329a0 to 1d304cd Compare August 6, 2021 20:01
@bkietz bkietz force-pushed the exec-node-factory-registry branch from 1d304cd to 73e5ef7 Compare August 6, 2021 20:06
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this (massive) refactor!

I left one small nit.

Comment on lines +41 to +58
namespace internal {

Result<std::vector<const HashAggregateKernel*>> GetKernels(
ExecContext* ctx, const std::vector<internal::Aggregate>& aggregates,
const std::vector<ValueDescr>& in_descrs);

Result<std::vector<std::unique_ptr<KernelState>>> InitKernels(
const std::vector<const HashAggregateKernel*>& kernels, ExecContext* ctx,
const std::vector<internal::Aggregate>& aggregates,
const std::vector<ValueDescr>& in_descrs);

Result<FieldVector> ResolveKernels(
const std::vector<internal::Aggregate>& aggregates,
const std::vector<const HashAggregateKernel*>& kernels,
const std::vector<std::unique_ptr<KernelState>>& states, ExecContext* ctx,
const std::vector<ValueDescr>& descrs);

} // namespace internal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth moving these to an actual _internal header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking I'd replace these with usage of Kernel::InitAll or other public & tested utilities in https://issues.apache.org/jira/browse/ARROW-13451. I've added a note to this effect to that JIRA

@bkietz
Copy link
Member Author

bkietz commented Aug 9, 2021

+1, merging

@bkietz bkietz closed this in 2aa94c4 Aug 9, 2021
@bkietz bkietz deleted the exec-node-factory-registry branch August 9, 2021 17:15
michalursa pushed a commit to michalursa/arrow that referenced this pull request Aug 17, 2021
… factories to a registry

- An extensible registry of exec node factories (`std::function<Result<ExecNode*>(ExecPlan* plan, std::vector<ExecNode*> inputs, const ExecNodeOptions& options)>`) is provided
- Hard coded factories like `compute::MakeSinkNode`, `dataset::MakeScanNode` are replaced by factories in the registry named "sink", "scan", etc
- `arrow::compute::Declaration` is provided to represent an unconstructed set of `ExecNode`s, which can be validated and emplaced into an `ExecPlan` as a unit

Closes apache#10793 from bkietz/exec-node-factory-registry

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants