ARROW-4124: [C++] Draft Aggregate and Sum kernels #3407

fsaintjacques · 2019-01-15T18:19:01Z

This is a draft of the aggregate kernel interface. The goal of publishing this PR is to gather feedback on the design and architecture.

AggregateUnaryKernel/AggregateState decomposition kernel implementation and parallel execution. The implementor of an AggregateState does not need to be concerned about details of parallel execution (minus the thread-safety of the Consume method). It also allows user to implement custom aggregates.
Monoid a class representing a mathematical monoid.
MonoidAggregateState a specific implementation of AggregateState that is generic enough to support various monoids. Note that it is limited to monoid known at compile time, usually monoids on primitives.

Currently only the "Sum" kernel is implemented. Further kernels will be implemented in follow up patches.

The current draft does not take into account filtering and groups. I do have an idea of how this would be done, e.g. add AggregateState::Consume(..., const Mask &filter). I delayed incorporating this interface until we implement the filtering kernel. This is going to be done in a different ticket.

bkietz · 2019-01-15T21:25:35Z

cpp/src/arrow/compute/kernel.h

Couldn't you just static_cast<Type::type>(this->value.which())?

We'd have to make sure the mapping is exact, which is almost equivalent to expliciting this list.

bkietz · 2019-01-15T21:31:19Z

cpp/src/arrow/compute/kernels/monoid-impl.h

These should be undefed below

No since they're "exported" macros in other files.

emkornfield

Not an expert in this space so take my comments with a grain of salt.

emkornfield · 2019-01-16T03:40:48Z

cpp/src/arrow/compute/kernels/aggregation.h

Generally, you would want to pass this through as a unique_ptr to make the ownership transfer explicit

emkornfield · 2019-01-16T04:11:31Z

cpp/src/arrow/compute/kernels/aggregation.cc

this seems like a slightly different pattern then was taken for UnaryArrayKernel (i.e. this pattern wasn't in the kernel, but instead in util-internal.h). I haven't thought through if a similar pattern should try to be established, have you?

I didn't see util-internal.h, I'll try to merge both. When I used this piece, I would have preferred to not be concerned with this and hoped that Datum provided a ChunkedArray for both cases (wrapping in a single element ChunkedArray when datum is Array).

I'll add this feature.

emkornfield · 2019-01-16T04:15:15Z

cpp/src/arrow/compute/kernels/aggregation.h

I think this is only true for Monoids?

Do you have an example of an aggregate function which is not associative?

Short answer (might be contrived): "exponentially smoothed moving average" and "hash elements"

Longer answer (probably obvious and too pedantic): I was assuming a naive parallelization where Consume could be called in any order which would require Commutativity and associativity. With non-commutative associative operations (e.g. string concatenation, FIRST, LAST, etc ) there are more book-keeping details to to ensure parallelization still returns the correct results. At the moment I'm having a hard time coming up with an example that benefits from parallelization of onsume that is not commutative.

From a practical standpoint for this review:

It might be good is add some additional pseudo-code for the parallelism AggregateState is meant to handle,

Possibly having some mechanism to signal commutative and non-commutative state.

To answer my own question about associate and not commutative multithreading, the parsing of CSV files is a good example where multithreading makes sense.

emkornfield · 2019-01-16T04:17:09Z

cpp/src/arrow/compute/kernels/monoid-impl.h

Are compilers generally good enough to optimize out any overhead from calling the constructor (I would assume so), or would it make sense to have have a slightly less clean Monoid abstraction that takes the raw ValueType?

The compiler is clever enough (at least with O3). But, we can also add a second operator+= that takes the scalar value. When I compile locally with -march=native, this is the hotloop as reported by perf.

10.96 │ vpaddq (%rax,%rdi,8),%zmm0,%zmm0 6.73 │ vpaddq 0x40(%rax,%rdi,8),%zmm1,%zmm1 5.26 │ vpaddq 0x80(%rax,%rdi,8),%zmm2,%zmm2 5.40 │ vpaddq 0xc0(%rax,%rdi,8),%zmm3,%zmm3 9.53 │ vpaddq 0x100(%rax,%rdi,8),%zmm0,%zmm0 5.78 │ vpaddq 0x140(%rax,%rdi,8),%zmm1,%zmm1 5.06 │ vpaddq 0x180(%rax,%rdi,8),%zmm2,%zmm2 5.06 │ vpaddq 0x1c0(%rax,%rdi,8),%zmm3,%zmm3 8.38 │ vpaddq 0x200(%rax,%rdi,8),%zmm0,%zmm0 5.53 │ vpaddq 0x240(%rax,%rdi,8),%zmm1,%zmm1 4.97 │ vpaddq 0x280(%rax,%rdi,8),%zmm2,%zmm2 4.84 │ vpaddq 0x2c0(%rax,%rdi,8),%zmm3,%zmm3 7.46 │ vpaddq 0x300(%rax,%rdi,8),%zmm0,%zmm0 5.39 │ vpaddq 0x340(%rax,%rdi,8),%zmm1,%zmm1 5.07 │ vpaddq 0x380(%rax,%rdi,8),%zmm2,%zmm2 4.34 │ vpaddq 0x3c0(%rax,%rdi,8),%zmm3,%zmm3 0.24 │ sub $0xffffffffffffff80,%rdi

Also note that some Monoid will be composed and aren't made of a single scalar value, e.g. the Mean monoid is made of std::tuple<size_t, T>.

👍 cool, its nice to see the compiler can auto-vectorize this.

emkornfield · 2019-01-16T04:26:01Z

cpp/src/arrow/compute/kernels/monoid-impl.h

I'm not sure if the discussion happened already but it would be good to understand the general plan around threading model, and how this relates the physical execution layer (i.e. something like the volcano model) and what exactly should live in which level (i.e. what belongs in the kernel/state).

I'm sure you considered this, but a different design might create a kernel per thread, then merge each scalar at the end if the work can in fact can be run in parallel. Probably not much difference performance wise but it still might eliminate an unnecessary memory barrier/system call.

Created ARROW-4333 to track this.

I also am not sure it's a good idea to lock inside Consume

emkornfield · 2019-01-16T04:27:03Z

cpp/src/arrow/compute/kernels/monoid-impl.h

Do monoids always have to be a scalar? Would it make sense to have monoid_ populate the datum instead of assuming this?

In a subsequent change the MonoidState will also be parametrized with a finalizer, which is a functor that materialize to a Datum. Mean and Hyperloglog are example that needs this.

emkornfield · 2019-01-16T04:29:20Z

cpp/src/arrow/compute/kernels/sum-test.cc

nit: CheckSum is a little bit confusing name due to checksum being a potential operation. Something like VerifySum might read a little better.

wesm · 2019-01-16T06:12:12Z

I will take a closer look, but one comment: thread control / multithreading should be handled (IMHO) at a higher level than the aggregation kernels.

emkornfield · 2019-01-17T03:52:29Z

cpp/src/arrow/compute/kernels/sum-benchmark.cc

might be nice to have a benchmark that includes some nulls

See some benchmarking I did about this http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

@emkornfield I will add both, my first benchmark was with nulls, but I removed them to to benchmark with perf.

Can you post a performance comparison with the hand-optimized loop?

emkornfield · 2019-01-17T03:53:30Z

cpp/src/arrow/compute/kernels/sum-test.cc

one interesting case is all nulls.

Yes, and it's already an issue with the current implementation, e.g. the min/max monoid would return std::numeric_limits<T>::max/min respectively and that is an issue; it should return NULL.

emkornfield · 2019-01-22T08:39:48Z

cpp/src/arrow/compute/kernel.h

Would it make sense to include string and unicode types in here?

Filed ARROW-4331 to track adding additional types.

wesm · 2019-01-23T01:32:26Z

Will review soon

fsaintjacques · 2019-01-23T16:54:38Z

As @emkornfield proposed, let's move threading/architecture of execution in https://issues.apache.org/jira/browse/ARROW-4333.

wesm

Unfortunately, this implementation has branching in the inner loop, so the approach here might have to be rethought. The performance difference by removing branching and partially unrolling the inner loop is huge. See http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

wesm · 2019-01-27T17:02:14Z

cpp/src/arrow/compute/kernel.h

Hm. I'm not sure about this. See https://issues.apache.org/jira/browse/ARROW-47. It's OK to leave this as is but marked experimental

The changes in this file feel like a kludge to me. Let's not do anything else involving scalar outputs until resolving ARROW-47. I'll put up a patch for that tomorrow with any luck

cpp/src/arrow/compute/kernels/aggregation.cc

cpp/src/arrow/compute/kernels/aggregation.h

wesm · 2019-01-27T17:05:56Z

cpp/src/arrow/compute/kernels/aggregation.h

wesm · 2019-01-27T17:11:33Z

cpp/src/arrow/compute/kernels/monoid-impl.h

I'm skeptical about whether this is going to generate good code when we don't compile with -march=native (because we can't do this in many of the binaries we ship...). Compare with some analysis I did of optimizing "sum" in http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ -- if we aren't coming within 5% of hand-optimized code with these abstractions, we need to go back to the drawing board

wesm · 2019-01-27T17:13:02Z

cpp/src/arrow/compute/kernels/monoid-impl.h

I also am not sure it's a good idea to lock inside Consume

wesm · 2019-01-27T17:14:00Z

cpp/src/arrow/compute/kernels/sum-benchmark.cc

Can you post a performance comparison with the hand-optimized loop?

wesm · 2019-01-27T17:27:41Z

Here's the hand-tuned implementation I'm referring to

https://github.com/wesm/bitmaps-vs-sentinels/blob/master/benchmark.cc#L227

wesm · 2019-01-27T17:51:09Z

Another question that needs to be resolved is how to return a NULL value. What we do in pandas is have a "min_values" option, so:

sum of all nulls is 0 by default, but you can set min_values=1 if you want it to be NULL instead
mean of all nulls is NULL (min_values=1 by default)

wesm · 2019-02-09T21:30:20Z

@fsaintjacques the PR description is out of date, can you update? I'm going to review and leave comments so we can merge this and move on to the next steps

Change-Id: Iacd7e6bdd6ce03cff3003ea1f8e703ba42c3ab7a

wesm

Some minor comments for follow up patches. I'm going to tweak a few things and then merge this once the build is passing

wesm · 2019-02-09T21:31:54Z

cpp/src/arrow/compute/kernel.h

The changes in this file feel like a kludge to me. Let's not do anything else involving scalar outputs until resolving ARROW-47. I'll put up a patch for that tomorrow with any luck

wesm · 2019-02-09T21:33:16Z

cpp/src/arrow/compute/kernels/CMakeLists.txt

+
+# Aggregates
+ADD_ARROW_TEST(sum-test PREFIX "arrow-compute")
+ADD_ARROW_BENCHMARK(sum-benchmark PREFIX "arrow-compute")


Let's plan to test and benchmark all the aggregates in one executable each (one for unit test, one for benchmark). So these names will have to change. Let's change them now so we don't have to move the file later

👍, such that we amortize the slow linking issue with windows?

among other things, yeah. I don't think we should have one test executable for each kernel, but rather to organize tests according to logical relationships (e.g. "mean" is close to "sum")

wesm · 2019-02-09T21:34:07Z

cpp/src/arrow/compute/kernels/aggregate.cc

+  static std::shared_ptr<ManagedAggregateState> Make(
+      std::shared_ptr<AggregateFunction>& desc, MemoryPool* pool) {
+    std::shared_ptr<Buffer> buf;
+    if (!AllocateBuffer(pool, desc->Size(), &buf).ok()) return nullptr;


This is a bad code smell

wesm · 2019-02-09T21:35:42Z

cpp/src/arrow/compute/kernels/aggregate.h

+class AggregateFunction {
+ public:
+  /// \brief Consume an array into a state.
+  virtual Status Consume(const Array& input, void* state) const = 0;


Any reason not to accept ArrayData here and about unnecessary boxing / unboxing?

The intent was to have a strongly typed parameters and let the error handling on the caller.

wesm · 2019-02-09T21:36:07Z

cpp/src/arrow/compute/kernels/aggregate.cc

+  auto state = ManagedAggregateState::Make(aggregate_function_, ctx->memory_pool());
+  if (!state) return Status::OutOfMemory("AggregateState allocation failed");
+
+  auto array = input.make_array();


See below comments re: this boxing step which could be avoided

wesm · 2019-02-09T22:49:26Z

cpp/src/arrow/compute/kernels/sum-test.cc

+  ValidateSum<TypeParam>(&this->ctx_, "[]", Datum());
+
+  ValidateSum<TypeParam>(&this->ctx_, "[0, 1, 2, 3, 4, 5]",
+                         Datum(Scalar(static_cast<SumType>(5 * 6 / 2))));


Should have a test for the "all nulls" case. The output can be preliminary (i.e. returning 0)

wesm · 2019-02-09T22:49:57Z

cpp/src/arrow/compute/kernels/sum.cc

+    if (input.null_count() > 0)
+      *state = ConsumeSparse(array);
+    else
+      *state = ConsumeDense(array);


wesm · 2019-02-09T22:50:58Z

cpp/src/arrow/compute/kernels/sum.cc

+    StateType local;
+
+    // TODO(fsaintjacques): This fails on slice not byte-aligned.
+    DCHECK_EQ(array.offset() % 8, 0);


Create follow up JIRA?

wesm · 2019-02-09T22:56:49Z

cpp/src/arrow/compute/kernels/sum.cc

+    const auto values = array.raw_values();
+    const auto bitmap = array.null_bitmap_data() + BitUtil::RoundDown(array.offset(), 8);
+    const auto length = array.length();
+    const auto length_rounded = BitUtil::RoundDown(length, 8);


Should write down explicit types instead of const auto

wesm · 2019-02-10T00:11:48Z

cpp/src/arrow/util/bit-util.h

+// Returns 'value' rounded down to the nearest multiple of 'factor'
+constexpr int64_t RoundDown(int64_t value, int64_t factor) {
+  return (value / factor) * factor;
+}


Add unit tests for this

Change-Id: I9a1f46f10c30d6f740c7896a2c72195013a714e7

Change-Id: I7f5d047191d58369f6e73650f3d705e967a2ba7b

Change-Id: Ib638796e4bb7c456625b8c29ba803f071051fd66

wesm · 2019-02-10T00:27:59Z

@fsaintjacques I tweaked the PR description. I'll merge this once the build passes in the interest of project velocity. There's a lot of follow up issues to create so I'll leave that to you

codecov-io · 2019-02-10T01:02:34Z

Codecov Report

Merging #3407 into master will increase coverage by 0.87%.
The diff coverage is 89.44%.

@@            Coverage Diff             @@
##           master    #3407      +/-   ##
==========================================
+ Coverage   87.75%   88.62%   +0.87%     
==========================================
  Files         667      542     -125     
  Lines       82541    73793    -8748     
  Branches     1069        0    -1069     
==========================================
- Hits        72434    65400    -7034     
+ Misses       9996     8393    -1603     
+ Partials      111        0     -111

Impacted Files	Coverage Δ
cpp/src/arrow/compute/kernels/aggregate-test.cc	`100% <100%> (ø)`
cpp/src/arrow/compute/kernels/hash-test.cc	`100% <100%> (ø)`	⬆️
cpp/src/arrow/test-random.h	`100% <100%> (ø)`	⬆️
cpp/src/arrow/compute/kernels/aggregate.cc	`100% <100%> (ø)`
cpp/src/arrow/util/bit-util.h	`100% <100%> (ø)`	⬆️
cpp/src/arrow/compute/kernel.h	`82.08% <55%> (-9.03%)`	⬇️
cpp/src/arrow/compute/kernels/aggregate.h	`87.5% <87.5%> (ø)`
cpp/src/arrow/compute/kernels/sum.cc	`87.8% <87.8%> (ø)`
go/arrow/math/uint64_amd64.go
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35c3e5e...568ba09. Read the comment docs.

wesm · 2019-02-10T01:09:28Z

The benchmark results on my machine for posterity:

---------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/0                    1 us          1 us     833069 null_percent=0 size=32.768k   37.2508GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/1                    1 us          1 us     838507 null_percent=1 size=32.768k   37.5707GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/10                   1 us          1 us     833487 null_percent=10 size=32.768k   37.6234GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/50                   1 us          1 us     691232 null_percent=50 size=32.768k   36.0022GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/0                  12 us         12 us      57468 null_percent=0 size=262.144k   20.2227GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/1                  12 us         12 us      57788 null_percent=1 size=262.144k     19.83GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/10                 12 us         12 us      58093 null_percent=10 size=262.144k   20.1576GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/50                 12 us         12 us      56866 null_percent=50 size=262.144k   20.0273GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/0                509 us        509 us       1421 null_percent=0 size=8.38861M   15.3606GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/1                504 us        504 us       1360 null_percent=1 size=8.38861M   15.5096GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/10               489 us        489 us       1425 null_percent=10 size=8.38861M   15.9891GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/50               517 us        517 us       1000 null_percent=50 size=8.38861M   15.1099GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/0              2336 us       2336 us        275 null_percent=0 size=33.5544M   13.3768GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/1              2344 us       2343 us        300 null_percent=1 size=33.5544M   13.3348GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/10             2337 us       2337 us        297 null_percent=10 size=33.5544M   13.3746GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/50             2404 us       2404 us        299 null_percent=50 size=33.5544M   13.0007GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/0                   3 us          3 us     218287 null_percent=0 size=32.768k   9.51592GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/1                   3 us          3 us     216732 null_percent=1 size=32.768k    9.4701GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/10                  3 us          3 us     218117 null_percent=10 size=32.768k   9.52378GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/50                  3 us          3 us     216833 null_percent=50 size=32.768k   9.45774GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/0                 26 us         26 us      27222 null_percent=0 size=262.144k   9.53919GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/1                 26 us         26 us      27233 null_percent=1 size=262.144k   9.52018GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/10                26 us         26 us      27106 null_percent=10 size=262.144k   9.52412GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/50                26 us         26 us      27039 null_percent=50 size=262.144k   9.52342GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/0               866 us        866 us        802 null_percent=0 size=8.38861M   9.02228GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/1               865 us        865 us        792 null_percent=1 size=8.38861M   9.03213GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/10              865 us        865 us        808 null_percent=10 size=8.38861M   9.02746GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/50              866 us        866 us        807 null_percent=50 size=8.38861M   9.02335GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/0             3541 us       3541 us        198 null_percent=0 size=33.5544M   8.82504GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/1             3541 us       3540 us        197 null_percent=1 size=33.5544M   8.82724GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/10            3545 us       3545 us        198 null_percent=10 size=33.5544M   8.81579GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/50            3543 us       3543 us        198 null_percent=50 size=33.5544M   8.81922GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/0              1 us          1 us     530767 null_percent=0 size=32.768k   23.1373GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/1              1 us          1 us     480515 null_percent=1 size=32.768k   20.9358GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/10             2 us          2 us     328397 null_percent=10 size=32.768k   14.2905GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/50             3 us          3 us     255490 null_percent=50 size=32.768k   11.1082GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/0            14 us         14 us      50120 null_percent=0 size=262.144k   17.4662GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/1            15 us         15 us      47530 null_percent=1 size=262.144k   16.6678GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/10           21 us         21 us      33159 null_percent=10 size=262.144k    11.662GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/50           23 us         23 us      30148 null_percent=50 size=262.144k   10.5367GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/0          553 us        553 us       1149 null_percent=0 size=8.38861M   14.1303GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/1          638 us        638 us       1101 null_percent=1 size=8.38861M   12.2365GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/10        1148 us       1148 us        612 null_percent=10 size=8.38861M   6.80722GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/50         780 us        780 us        893 null_percent=50 size=8.38861M   10.0146GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/0        2513 us       2513 us        281 null_percent=0 size=33.5544M   12.4369GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/1        2854 us       2854 us        246 null_percent=1 size=33.5544M   10.9505GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/10       5169 us       5170 us        100 null_percent=10 size=33.5544M   6.04495GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/50       3231 us       3231 us        207 null_percent=50 size=33.5544M   9.67302GB/s

The L1-cache-sized benchmarks seem like they can be a bit noisy

fsaintjacques · 2019-02-11T16:07:34Z

cpp/src/arrow/compute/kernels/aggregate-benchmark.cc

+    const auto length = array.length();
+    for (int64_t i = 0; i < length; i++) {
+      // NaN is not equal to itself
+      local.total += values[i] * Traits<T>::NotNull(values[i]);


The previous version was faster, clang-6 and onward auto-vectorize this perfectly (even with just O3): https://godbolt.org/z/v5nFNM . The result would be even better with AVX512 enabled (see https://godbolt.org/z/LFcQJv and https://llvm.org/devmtg/2015-04/slides/MaskedIntrinsics.pdf).

…ue frequencies Picked up from: #1970 I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code. Some other changes I mae: 1. Remove the HashKernel and getter from the public API. I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. #3407) 2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak). This seemed better then using a vector to collect values. If this approach looks OK for now. I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out. other things to discuss: 1. Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on). 2. Hashing edge cases for floating point numbers (just opened a JIRA on this). Author: Micah Kornfield <emkornfield@gmail.com> Author: Alessandro Andrioni <alessandroandrioni@gmail.com> Closes #3579 from emkornfield/count_kernel and squashes the following commits: 9c55f7b <Micah Kornfield> make 64 bit dd0d8a1 <Micah Kornfield> fix link and warning 72095eb <Micah Kornfield> Templatize whether to use return status 54afb2b <Micah Kornfield> change from std::copy to memcopy 2973fcc <Micah Kornfield> Address code review comments e7e624b <Micah Kornfield> Add constants, per code review 4770d99 <Micah Kornfield> fix warning c6f6ad7 <Micah Kornfield> address feedback d99e52f <Micah Kornfield> add guard to CopyValue in cases where vector is empty 8c26b01 <Micah Kornfield> fix format b7d5492 <Micah Kornfield> add null test f964bd6 <Micah Kornfield> Rebase e8e58a5 <Micah Kornfield> Address output type code review feedback defb4f1 <Micah Kornfield> remove export from .cc 0152f2f <Micah Kornfield> plumb through status on hash visitors afeb1ad <Micah Kornfield> add real jira 96858bd <Micah Kornfield> Use macro inversion to reduce boiler plate 0dd0077 <Micah Kornfield> minimal test 57349f7 <Micah Kornfield> unit tests passing 34834f7 <Alessandro Andrioni> First try at implementing a CountValues kernel

…ue frequencies Picked up from: apache/arrow#1970 I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code. Some other changes I mae: 1. Remove the HashKernel and getter from the public API. I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. apache/arrow#3407) 2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak). This seemed better then using a vector to collect values. If this approach looks OK for now. I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out. other things to discuss: 1. Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on). 2. Hashing edge cases for floating point numbers (just opened a JIRA on this). Author: Micah Kornfield <emkornfield@gmail.com> Author: Alessandro Andrioni <alessandroandrioni@gmail.com> Closes #3579 from emkornfield/count_kernel and squashes the following commits: 9c55f7ba6 <Micah Kornfield> make 64 bit dd0d8a155 <Micah Kornfield> fix link and warning 72095ebc4 <Micah Kornfield> Templatize whether to use return status 54afb2bac <Micah Kornfield> change from std::copy to memcopy 2973fccbe <Micah Kornfield> Address code review comments e7e624b1f <Micah Kornfield> Add constants, per code review 4770d9924 <Micah Kornfield> fix warning c6f6ad72f <Micah Kornfield> address feedback d99e52fb7 <Micah Kornfield> add guard to CopyValue in cases where vector is empty 8c26b0154 <Micah Kornfield> fix format b7d54929a <Micah Kornfield> add null test f964bd6da <Micah Kornfield> Rebase e8e58a5b9 <Micah Kornfield> Address output type code review feedback defb4f1a1 <Micah Kornfield> remove export from .cc 0152f2fa5 <Micah Kornfield> plumb through status on hash visitors afeb1ad04 <Micah Kornfield> add real jira 96858bd52 <Micah Kornfield> Use macro inversion to reduce boiler plate 0dd007718 <Micah Kornfield> minimal test 57349f7ea <Micah Kornfield> unit tests passing 34834f711 <Alessandro Andrioni> First try at implementing a CountValues kernel

bkietz reviewed Jan 15, 2019

View reviewed changes

emkornfield reviewed Jan 16, 2019

View reviewed changes

emkornfield reviewed Jan 17, 2019

View reviewed changes

emkornfield reviewed Jan 22, 2019

View reviewed changes

fsaintjacques force-pushed the ARROW-4124-aggregation-kernel branch from 680d60c to 7fb9282 Compare January 23, 2019 16:46

wesm requested changes Jan 27, 2019

View reviewed changes

emkornfield mentioned this pull request Feb 7, 2019

ARROW-1572: [C++] Implement "value counts" kernels for tabulating value frequencies #3579

Closed

fsaintjacques force-pushed the ARROW-4124-aggregation-kernel branch 2 times, most recently from 8fe6cd3 to d91002d Compare February 8, 2019 17:10

xhochy force-pushed the master branch from 29f76df to 2b9155a Compare February 8, 2019 20:59

fsaintjacques added 10 commits February 9, 2019 15:30

ARROW-4124: Implement Aggregate and Sum kernels

3e9451e

Fix

92a4e03

Fix warnings

f1044be

Fix lint issues.

c4b5b21

Doxygen error

440ed3d

Refactor

d5ea4c8

lint and format

0ff3bae

Remove mutex header

cadc709

Use bigger width type for accumulator.

f062737

Warnings

9bd1f25

wesm force-pushed the ARROW-4124-aggregation-kernel branch from 78f9dbf to 9bd1f25 Compare February 9, 2019 21:30

Use multiplication method with sentinel benchmarks

27d32b5

Change-Id: Iacd7e6bdd6ce03cff3003ea1f8e703ba42c3ab7a

wesm reviewed Feb 10, 2019

View reviewed changes

wesm added 3 commits February 9, 2019 18:16

Rename arrow-compute-sum-benchmark/test

985e9ce

Change-Id: I9a1f46f10c30d6f740c7896a2c72195013a714e7

Some code review comments

c272059

Change-Id: I7f5d047191d58369f6e73650f3d705e967a2ba7b

Add braces

568ba09

Change-Id: Ib638796e4bb7c456625b8c29ba803f071051fd66

wesm approved these changes Feb 10, 2019

View reviewed changes

wesm changed the title ~~ARROW-4124: [C++] Draft Aggregate and Sum kernels [WIP]~~ ARROW-4124: [C++] Draft Aggregate and Sum kernels Feb 10, 2019

wesm closed this in 9a10bae Feb 10, 2019

fsaintjacques commented Feb 11, 2019

View reviewed changes

This was referenced Feb 11, 2019

[C++] Abstract aggregation kernel API #20713

Closed

[C++] Extend Scalar Datum to support more types #15947

Closed

ARROW-4124: [C++] Draft Aggregate and Sum kernels #3407

ARROW-4124: [C++] Draft Aggregate and Sum kernels #3407

Uh oh!

Conversation

fsaintjacques commented Jan 15, 2019 • edited by wesm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsaintjacques Jan 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jan 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsaintjacques Jan 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsaintjacques commented Jan 15, 2019 •

edited by wesm

Loading

emkornfield Jan 18, 2019 •

edited

Loading

fsaintjacques Jan 16, 2019 •

edited

Loading

fsaintjacques Jan 17, 2019 •

edited

Loading