ARROW-13313: [C++][Compute] Add scalar aggregate node #10705

bkietz · 2021-07-12T19:34:25Z

This is a pretty trivial node but it's needed for completeness and will give bindings a pipeline breaker to experiment with until #10660 merges

github-actions · 2021-07-12T19:34:45Z

https://issues.apache.org/jira/browse/ARROW-13313

cpp/src/arrow/compute/exec/exec_plan.h

cpp/src/arrow/compute/exec/exec_plan.cc

westonpace

A few thoughts but looks good. It helps my understanding to see the aggregate kernels in action.

cpp/src/arrow/compute/exec/exec_plan.cc

westonpace · 2021-07-12T21:38:00Z

cpp/src/arrow/compute/exec/exec_plan.cc

+    ExecBatch batch{{}, 1};
+    batch.values.resize(kernels_.size());
+
+    for (size_t i = 0; i < kernels_.size(); ++i) {


Maybe someday in the future we could merge each kernel on its own thread but that can be for a future PR.

Merging scalar aggregates is pretty trivial so I'd guess we don't gain much with parallelization. Worth investigating in a follow up, though

Ah, In my head "merge" meant something more like a merge sort. I agree, if it's just summing up a sum/mean/etc. counter across the various states then I agree it's not necessary.

westonpace · 2021-07-12T21:44:15Z

cpp/src/arrow/compute/exec/exec_plan.cc

+      return;
+    }
+
+    lock.lock();


This lock could probably be removed. We might want to make a note to measure this with micro benchmarks someday. Only one thread should be finishing anyways and the "what state blocks have we used" map could probably be a lock free structure.

InputReceived(last batch) might be called concurrently with InputFinished, so those two must synchronize to ensure only one does the finishing. It'd certainly be helpful to introduce less clumsy control flow in these classes

cpp/src/arrow/compute/exec/exec_plan.cc

lidavidm · 2021-07-13T14:37:39Z

cpp/src/arrow/compute/exec/exec_plan.cc

+    return Status::OK();
+  }
+
+  void InputReceived(ExecNode* input, int seq, ExecBatch batch) override {


Question: to implement something like ARROW-12710 (string concat aggregate kernel) we'll need to know the order of inputs in the kernels (or will have to feed results into the kernel in order) - how do we plan to handle that? Passing down seq and having each kernel reorder inputs itself, or perhaps with an upstream ExecNode that orders its inputs? This also applies to the group by node.

seq is not an indication of order, it's only a tag in the range [0, seq_stop) (where seq_stop is set by InputFinished) so we could not use it to order results.

As specified in ARROW-12710, the KernelState of the string concat agg kernel will need to include ordering criteria so that merge(move(state1), &state0) can be guaranteed equivalent to merge(move(state0), &state1). Furthermore, merge cannot actually concatenate anything because if we happened to first merge(move(state0), &state3) we'd have no way to insert state1, state2 in the middle later. Actual concatenation would have to wait for finalize.

Those ordering criteria could be synthesized from (for example) fragment/batch index information, but the presence of O(N) state in a scalar agg kernel's State is suspect to me and I'm not sure it's a great match for ScalarAggregateKernel.

Ah thanks, sorry for the misunderstanding (I need to stop thinking only about datasets).

I suppose it only makes sense to talk about 'order' when directly downstream from a scan or explicit sort, then. And any aggregates that have O(N) state might properly belong as their own ExecNode.

lidavidm · 2021-07-13T20:55:17Z

cpp/src/arrow/dataset/scanner_test.cc

+
+  // finally, pipe the project node into a sink node
+  // NB: if we don't need ordering, we could use compute::MakeSinkNode instead
+  ASSERT_OK_AND_ASSIGN(auto sink_gen, dataset::MakeOrderedSinkNode(project, "sink"));


Thanks for the example! I do like this setup a lot more since it is clearer what all the steps are in reading a dataset + everything is neatly separated. It is not that the current scanner does not separate up the various stages of its pipeline, but the pipeline in the scanner is rather tied together while this is clearly partitioned.

bkietz added 3 commits July 12, 2021 15:03

ARROW-13313: [C++][Compute] Add ScalarAggregateNode

7fd1754

allow ProjectNode to explicitly provide field names

a47853d

Provide a RecordBatchReader sink node

21a9595

bkietz requested review from lidavidm and nealrichardson July 12, 2021 19:34

github-actions bot added the Component: C++ label Jul 12, 2021

extract ExecBatch::ToRecordBatch

b092fed

lidavidm approved these changes Jul 12, 2021

View reviewed changes

cpp/src/arrow/compute/exec/exec_plan.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/exec/exec_plan.h Show resolved Hide resolved

cpp/src/arrow/compute/exec/exec_plan.cc Show resolved Hide resolved

westonpace approved these changes Jul 12, 2021

View reviewed changes

lidavidm reviewed Jul 13, 2021

View reviewed changes

bkietz added 2 commits July 13, 2021 13:25

review comments

d77a0a6

add EndToEnd test illustrating usage of ExecPlan, ScanNode, ...

b153862

lidavidm reviewed Jul 13, 2021

View reviewed changes

bkietz closed this in 7114c4b Jul 13, 2021

bkietz deleted the 13313-Add-ScalarAggregateNode branch July 13, 2021 20:58

asfimport mentioned this pull request Jul 13, 2021

[C++][Compute] Add ScalarAggregateNode #28989

Closed

ARROW-13313: [C++][Compute] Add scalar aggregate node #10705

ARROW-13313: [C++][Compute] Add scalar aggregate node #10705

Uh oh!

Conversation

bkietz commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkietz commented Jul 12, 2021 •

edited

Loading