Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

@fsaintjacques fsaintjacques commented Jan 15, 2019

This is a draft of the aggregate kernel interface. The goal of publishing this PR is to gather feedback on the design and architecture.

  • AggregateUnaryKernel/AggregateState decomposition kernel implementation and parallel execution. The implementor of an AggregateState does not need to be concerned about details of parallel execution (minus the thread-safety of the Consume method). It also allows user to implement custom aggregates.
  • Monoid a class representing a mathematical monoid.
  • MonoidAggregateState a specific implementation of AggregateState that is generic enough to support various monoids. Note that it is limited to monoid known at compile time, usually monoids on primitives.

Currently only the "Sum" kernel is implemented. Further kernels will be implemented in follow up patches.

The current draft does not take into account filtering and groups. I do have an idea of how this would be done, e.g. add AggregateState::Consume(..., const Mask &filter). I delayed incorporating this interface until we implement the filtering kernel. This is going to be done in a different ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just static_cast<Type::type>(this->value.which())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to make sure the mapping is exact, which is almost equivalent to expliciting this list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be undefed below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No since they're "exported" macros in other files.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert in this space so take my comments with a grain of salt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, you would want to pass this through as a unique_ptr to make the ownership transfer explicit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a slightly different pattern then was taken for UnaryArrayKernel (i.e. this pattern wasn't in the kernel, but instead in util-internal.h). I haven't thought through if a similar pattern should try to be established, have you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see util-internal.h, I'll try to merge both. When I used this piece, I would have preferred to not be concerned with this and hoped that Datum provided a ChunkedArray for both cases (wrapping in a single element ChunkedArray when datum is Array).

I'll add this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only true for Monoids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of an aggregate function which is not associative?

Copy link
Contributor

@emkornfield emkornfield Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer (might be contrived): "exponentially smoothed moving average" and "hash elements"

Longer answer (probably obvious and too pedantic): I was assuming a naive parallelization where Consume could be called in any order which would require Commutativity and associativity. With non-commutative associative operations (e.g. string concatenation, FIRST, LAST, etc ) there are more book-keeping details to to ensure parallelization still returns the correct results. At the moment I'm having a hard time coming up with an example that benefits from parallelization of onsume that is not commutative.

From a practical standpoint for this review:

  • It might be good is add some additional pseudo-code for the parallelism AggregateState is meant to handle,
  • Possibly having some mechanism to signal commutative and non-commutative state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my own question about associate and not commutative multithreading, the parsing of CSV files is a good example where multithreading makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are compilers generally good enough to optimize out any overhead from calling the constructor (I would assume so), or would it make sense to have have a slightly less clean Monoid abstraction that takes the raw ValueType?

Copy link
Contributor Author

@fsaintjacques fsaintjacques Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler is clever enough (at least with O3). But, we can also add a second operator+= that takes the scalar value. When I compile locally with -march=native, this is the hotloop as reported by perf.

10.96 │       vpaddq (%rax,%rdi,8),%zmm0,%zmm0     
 6.73 │       vpaddq 0x40(%rax,%rdi,8),%zmm1,%zmm1 
 5.26 │       vpaddq 0x80(%rax,%rdi,8),%zmm2,%zmm2 
 5.40 │       vpaddq 0xc0(%rax,%rdi,8),%zmm3,%zmm3 
 9.53 │       vpaddq 0x100(%rax,%rdi,8),%zmm0,%zmm0
 5.78 │       vpaddq 0x140(%rax,%rdi,8),%zmm1,%zmm1
 5.06 │       vpaddq 0x180(%rax,%rdi,8),%zmm2,%zmm2
 5.06 │       vpaddq 0x1c0(%rax,%rdi,8),%zmm3,%zmm3
 8.38 │       vpaddq 0x200(%rax,%rdi,8),%zmm0,%zmm0
 5.53 │       vpaddq 0x240(%rax,%rdi,8),%zmm1,%zmm1
 4.97 │       vpaddq 0x280(%rax,%rdi,8),%zmm2,%zmm2
 4.84 │       vpaddq 0x2c0(%rax,%rdi,8),%zmm3,%zmm3
 7.46 │       vpaddq 0x300(%rax,%rdi,8),%zmm0,%zmm0
 5.39 │       vpaddq 0x340(%rax,%rdi,8),%zmm1,%zmm1
 5.07 │       vpaddq 0x380(%rax,%rdi,8),%zmm2,%zmm2
 4.34 │       vpaddq 0x3c0(%rax,%rdi,8),%zmm3,%zmm3
 0.24 │       sub    $0xffffffffffffff80,%rdi      

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that some Monoid will be composed and aren't made of a single scalar value, e.g. the Mean monoid is made of std::tuple<size_t, T>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 cool, its nice to see the compiler can auto-vectorize this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the discussion happened already but it would be good to understand the general plan around threading model, and how this relates the physical execution layer (i.e. something like the volcano model) and what exactly should live in which level (i.e. what belongs in the kernel/state).

I'm sure you considered this, but a different design might create a kernel per thread, then merge each scalar at the end if the work can in fact can be run in parallel. Probably not much difference performance wise but it still might eliminate an unnecessary memory barrier/system call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created ARROW-4333 to track this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also am not sure it's a good idea to lock inside Consume

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do monoids always have to be a scalar? Would it make sense to have monoid_ populate the datum instead of assuming this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a subsequent change the MonoidState will also be parametrized with a finalizer, which is a functor that materialize to a Datum. Mean and Hyperloglog are example that needs this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: CheckSum is a little bit confusing name due to checksum being a potential operation. Something like VerifySum might read a little better.

@wesm
Copy link
Member

wesm commented Jan 16, 2019

I will take a closer look, but one comment: thread control / multithreading should be handled (IMHO) at a higher level than the aggregation kernels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to have a benchmark that includes some nulls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See some benchmarking I did about this http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield I will add both, my first benchmark was with nulls, but I removed them to to benchmark with perf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post a performance comparison with the hand-optimized loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one interesting case is all nulls.

Copy link
Contributor Author

@fsaintjacques fsaintjacques Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it's already an issue with the current implementation, e.g. the min/max monoid would return std::numeric_limits<T>::max/min respectively and that is an issue; it should return NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to include string and unicode types in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed ARROW-4331 to track adding additional types.

@wesm
Copy link
Member

wesm commented Jan 23, 2019

Will review soon

@fsaintjacques fsaintjacques force-pushed the ARROW-4124-aggregation-kernel branch from 680d60c to 7fb9282 Compare January 23, 2019 16:46
@fsaintjacques
Copy link
Contributor Author

As @emkornfield proposed, let's move threading/architecture of execution in https://issues.apache.org/jira/browse/ARROW-4333.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this implementation has branching in the inner loop, so the approach here might have to be rethought. The performance difference by removing branching and partially unrolling the inner loop is huge. See http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'm not sure about this. See https://issues.apache.org/jira/browse/ARROW-47. It's OK to leave this as is but marked experimental

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file feel like a kludge to me. Let's not do anything else involving scalar outputs until resolving ARROW-47. I'll put up a patch for that tomorrow with any luck

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical about whether this is going to generate good code when we don't compile with -march=native (because we can't do this in many of the binaries we ship...). Compare with some analysis I did of optimizing "sum" in http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ -- if we aren't coming within 5% of hand-optimized code with these abstractions, we need to go back to the drawing board

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also am not sure it's a good idea to lock inside Consume

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post a performance comparison with the hand-optimized loop?

@wesm
Copy link
Member

wesm commented Jan 27, 2019

Here's the hand-tuned implementation I'm referring to

https://github.com/wesm/bitmaps-vs-sentinels/blob/master/benchmark.cc#L227

@wesm
Copy link
Member

wesm commented Jan 27, 2019

Another question that needs to be resolved is how to return a NULL value. What we do in pandas is have a "min_values" option, so:

  • sum of all nulls is 0 by default, but you can set min_values=1 if you want it to be NULL instead
  • mean of all nulls is NULL (min_values=1 by default)

@wesm
Copy link
Member

wesm commented Feb 9, 2019

@fsaintjacques the PR description is out of date, can you update? I'm going to review and leave comments so we can merge this and move on to the next steps

@wesm wesm force-pushed the ARROW-4124-aggregation-kernel branch from 78f9dbf to 9bd1f25 Compare February 9, 2019 21:30
Change-Id: Iacd7e6bdd6ce03cff3003ea1f8e703ba42c3ab7a
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments for follow up patches. I'm going to tweak a few things and then merge this once the build is passing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file feel like a kludge to me. Let's not do anything else involving scalar outputs until resolving ARROW-47. I'll put up a patch for that tomorrow with any luck


# Aggregates
ADD_ARROW_TEST(sum-test PREFIX "arrow-compute")
ADD_ARROW_BENCHMARK(sum-benchmark PREFIX "arrow-compute")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's plan to test and benchmark all the aggregates in one executable each (one for unit test, one for benchmark). So these names will have to change. Let's change them now so we don't have to move the file later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, such that we amortize the slow linking issue with windows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

among other things, yeah. I don't think we should have one test executable for each kernel, but rather to organize tests according to logical relationships (e.g. "mean" is close to "sum")

static std::shared_ptr<ManagedAggregateState> Make(
std::shared_ptr<AggregateFunction>& desc, MemoryPool* pool) {
std::shared_ptr<Buffer> buf;
if (!AllocateBuffer(pool, desc->Size(), &buf).ok()) return nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad code smell

class AggregateFunction {
public:
/// \brief Consume an array into a state.
virtual Status Consume(const Array& input, void* state) const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to accept ArrayData here and about unnecessary boxing / unboxing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to have a strongly typed parameters and let the error handling on the caller.

auto state = ManagedAggregateState::Make(aggregate_function_, ctx->memory_pool());
if (!state) return Status::OutOfMemory("AggregateState allocation failed");

auto array = input.make_array();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below comments re: this boxing step which could be avoided

ValidateSum<TypeParam>(&this->ctx_, "[]", Datum());

ValidateSum<TypeParam>(&this->ctx_, "[0, 1, 2, 3, 4, 5]",
Datum(Scalar(static_cast<SumType>(5 * 6 / 2))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a test for the "all nulls" case. The output can be preliminary (i.e. returning 0)

if (input.null_count() > 0)
*state = ConsumeSparse(array);
else
*state = ConsumeDense(array);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

braces

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

StateType local;

// TODO(fsaintjacques): This fails on slice not byte-aligned.
DCHECK_EQ(array.offset() % 8, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create follow up JIRA?

const auto values = array.raw_values();
const auto bitmap = array.null_bitmap_data() + BitUtil::RoundDown(array.offset(), 8);
const auto length = array.length();
const auto length_rounded = BitUtil::RoundDown(length, 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should write down explicit types instead of const auto

// Returns 'value' rounded down to the nearest multiple of 'factor'
constexpr int64_t RoundDown(int64_t value, int64_t factor) {
return (value / factor) * factor;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit tests for this

wesm added 3 commits February 9, 2019 18:16
Change-Id: I9a1f46f10c30d6f740c7896a2c72195013a714e7
Change-Id: I7f5d047191d58369f6e73650f3d705e967a2ba7b
Change-Id: Ib638796e4bb7c456625b8c29ba803f071051fd66
@wesm wesm changed the title ARROW-4124: [C++] Draft Aggregate and Sum kernels [WIP] ARROW-4124: [C++] Draft Aggregate and Sum kernels Feb 10, 2019
@wesm
Copy link
Member

wesm commented Feb 10, 2019

@fsaintjacques I tweaked the PR description. I'll merge this once the build passes in the interest of project velocity. There's a lot of follow up issues to create so I'll leave that to you

@codecov-io
Copy link

Codecov Report

Merging #3407 into master will increase coverage by 0.87%.
The diff coverage is 89.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3407      +/-   ##
==========================================
+ Coverage   87.75%   88.62%   +0.87%     
==========================================
  Files         667      542     -125     
  Lines       82541    73793    -8748     
  Branches     1069        0    -1069     
==========================================
- Hits        72434    65400    -7034     
+ Misses       9996     8393    -1603     
+ Partials      111        0     -111
Impacted Files Coverage Δ
cpp/src/arrow/compute/kernels/aggregate-test.cc 100% <100%> (ø)
cpp/src/arrow/compute/kernels/hash-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/test-random.h 100% <100%> (ø) ⬆️
cpp/src/arrow/compute/kernels/aggregate.cc 100% <100%> (ø)
cpp/src/arrow/util/bit-util.h 100% <100%> (ø) ⬆️
cpp/src/arrow/compute/kernel.h 82.08% <55%> (-9.03%) ⬇️
cpp/src/arrow/compute/kernels/aggregate.h 87.5% <87.5%> (ø)
cpp/src/arrow/compute/kernels/sum.cc 87.8% <87.8%> (ø)
go/arrow/math/uint64_amd64.go
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35c3e5e...568ba09. Read the comment docs.

@wesm wesm closed this in 9a10bae Feb 10, 2019
@wesm
Copy link
Member

wesm commented Feb 10, 2019

The benchmark results on my machine for posterity:

---------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/0                    1 us          1 us     833069 null_percent=0 size=32.768k   37.2508GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/1                    1 us          1 us     838507 null_percent=1 size=32.768k   37.5707GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/10                   1 us          1 us     833487 null_percent=10 size=32.768k   37.6234GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/32768/50                   1 us          1 us     691232 null_percent=50 size=32.768k   36.0022GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/0                  12 us         12 us      57468 null_percent=0 size=262.144k   20.2227GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/1                  12 us         12 us      57788 null_percent=1 size=262.144k     19.83GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/10                 12 us         12 us      58093 null_percent=10 size=262.144k   20.1576GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/262144/50                 12 us         12 us      56866 null_percent=50 size=262.144k   20.0273GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/0                509 us        509 us       1421 null_percent=0 size=8.38861M   15.3606GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/1                504 us        504 us       1360 null_percent=1 size=8.38861M   15.5096GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/10               489 us        489 us       1425 null_percent=10 size=8.38861M   15.9891GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/8388608/50               517 us        517 us       1000 null_percent=50 size=8.38861M   15.1099GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/0              2336 us       2336 us        275 null_percent=0 size=33.5544M   13.3768GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/1              2344 us       2343 us        300 null_percent=1 size=33.5544M   13.3348GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/10             2337 us       2337 us        297 null_percent=10 size=33.5544M   13.3746GB/s
BenchSum<SumNoNullsUnrolled<int64_t>>/33554432/50             2404 us       2404 us        299 null_percent=50 size=33.5544M   13.0007GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/0                   3 us          3 us     218287 null_percent=0 size=32.768k   9.51592GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/1                   3 us          3 us     216732 null_percent=1 size=32.768k    9.4701GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/10                  3 us          3 us     218117 null_percent=10 size=32.768k   9.52378GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/32768/50                  3 us          3 us     216833 null_percent=50 size=32.768k   9.45774GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/0                 26 us         26 us      27222 null_percent=0 size=262.144k   9.53919GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/1                 26 us         26 us      27233 null_percent=1 size=262.144k   9.52018GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/10                26 us         26 us      27106 null_percent=10 size=262.144k   9.52412GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/262144/50                26 us         26 us      27039 null_percent=50 size=262.144k   9.52342GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/0               866 us        866 us        802 null_percent=0 size=8.38861M   9.02228GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/1               865 us        865 us        792 null_percent=1 size=8.38861M   9.03213GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/10              865 us        865 us        808 null_percent=10 size=8.38861M   9.02746GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/8388608/50              866 us        866 us        807 null_percent=50 size=8.38861M   9.02335GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/0             3541 us       3541 us        198 null_percent=0 size=33.5544M   8.82504GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/1             3541 us       3540 us        197 null_percent=1 size=33.5544M   8.82724GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/10            3545 us       3545 us        198 null_percent=10 size=33.5544M   8.81579GB/s
BenchSum<SumSentinelUnrolled<int64_t>>/33554432/50            3543 us       3543 us        198 null_percent=50 size=33.5544M   8.81922GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/0              1 us          1 us     530767 null_percent=0 size=32.768k   23.1373GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/1              1 us          1 us     480515 null_percent=1 size=32.768k   20.9358GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/10             2 us          2 us     328397 null_percent=10 size=32.768k   14.2905GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/32768/50             3 us          3 us     255490 null_percent=50 size=32.768k   11.1082GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/0            14 us         14 us      50120 null_percent=0 size=262.144k   17.4662GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/1            15 us         15 us      47530 null_percent=1 size=262.144k   16.6678GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/10           21 us         21 us      33159 null_percent=10 size=262.144k    11.662GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/262144/50           23 us         23 us      30148 null_percent=50 size=262.144k   10.5367GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/0          553 us        553 us       1149 null_percent=0 size=8.38861M   14.1303GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/1          638 us        638 us       1101 null_percent=1 size=8.38861M   12.2365GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/10        1148 us       1148 us        612 null_percent=10 size=8.38861M   6.80722GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/8388608/50         780 us        780 us        893 null_percent=50 size=8.38861M   10.0146GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/0        2513 us       2513 us        281 null_percent=0 size=33.5544M   12.4369GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/1        2854 us       2854 us        246 null_percent=1 size=33.5544M   10.9505GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/10       5169 us       5170 us        100 null_percent=10 size=33.5544M   6.04495GB/s
BenchSum<SumBitmapVectorizeUnroll<int64_t>>/33554432/50       3231 us       3231 us        207 null_percent=50 size=33.5544M   9.67302GB/s

The L1-cache-sized benchmarks seem like they can be a bit noisy

const auto length = array.length();
for (int64_t i = 0; i < length; i++) {
// NaN is not equal to itself
local.total += values[i] * Traits<T>::NotNull(values[i]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version was faster, clang-6 and onward auto-vectorize this perfectly (even with just O3): https://godbolt.org/z/v5nFNM . The result would be even better with AVX512 enabled (see https://godbolt.org/z/LFcQJv and https://llvm.org/devmtg/2015-04/slides/MaskedIntrinsics.pdf).

wesm pushed a commit that referenced this pull request Mar 14, 2019
…ue frequencies

Picked up from: #1970

I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code.  Some other changes I mae:
1.  Remove the HashKernel and getter from the public API.  I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. #3407)
2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak).  This seemed better then using a vector to collect values.

If this approach looks OK for now.  I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out.

other things to discuss:
1.  Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on).
2.  Hashing edge cases for floating point numbers (just opened a JIRA on this).

Author: Micah Kornfield <emkornfield@gmail.com>
Author: Alessandro Andrioni <alessandroandrioni@gmail.com>

Closes #3579 from emkornfield/count_kernel and squashes the following commits:

9c55f7b <Micah Kornfield> make 64 bit
dd0d8a1 <Micah Kornfield> fix link and warning
72095eb <Micah Kornfield> Templatize whether to use return status
54afb2b <Micah Kornfield> change from std::copy to memcopy
2973fcc <Micah Kornfield> Address code review comments
e7e624b <Micah Kornfield> Add constants, per code review
4770d99 <Micah Kornfield> fix warning
c6f6ad7 <Micah Kornfield> address feedback
d99e52f <Micah Kornfield> add guard to CopyValue in cases where vector is empty
8c26b01 <Micah Kornfield> fix format
b7d5492 <Micah Kornfield> add null test
f964bd6 <Micah Kornfield> Rebase
e8e58a5 <Micah Kornfield> Address output type code review feedback
defb4f1 <Micah Kornfield> remove export from .cc
0152f2f <Micah Kornfield> plumb through status on hash visitors
afeb1ad <Micah Kornfield> add real jira
96858bd <Micah Kornfield> Use macro inversion to reduce boiler plate
0dd0077 <Micah Kornfield> minimal test
57349f7 <Micah Kornfield> unit tests passing
34834f7 <Alessandro Andrioni> First try at implementing a CountValues kernel
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…ue frequencies

Picked up from: apache/arrow#1970

I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code.  Some other changes I mae:
1.  Remove the HashKernel and getter from the public API.  I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. apache/arrow#3407)
2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak).  This seemed better then using a vector to collect values.

If this approach looks OK for now.  I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out.

other things to discuss:
1.  Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on).
2.  Hashing edge cases for floating point numbers (just opened a JIRA on this).

Author: Micah Kornfield <emkornfield@gmail.com>
Author: Alessandro Andrioni <alessandroandrioni@gmail.com>

Closes #3579 from emkornfield/count_kernel and squashes the following commits:

9c55f7ba6 <Micah Kornfield> make 64 bit
dd0d8a155 <Micah Kornfield> fix link and warning
72095ebc4 <Micah Kornfield> Templatize whether to use return status
54afb2bac <Micah Kornfield> change from std::copy to memcopy
2973fccbe <Micah Kornfield> Address code review comments
e7e624b1f <Micah Kornfield> Add constants, per code review
4770d9924 <Micah Kornfield> fix warning
c6f6ad72f <Micah Kornfield> address feedback
d99e52fb7 <Micah Kornfield> add guard to CopyValue in cases where vector is empty
8c26b0154 <Micah Kornfield> fix format
b7d54929a <Micah Kornfield> add null test
f964bd6da <Micah Kornfield> Rebase
e8e58a5b9 <Micah Kornfield> Address output type code review feedback
defb4f1a1 <Micah Kornfield> remove export from .cc
0152f2fa5 <Micah Kornfield> plumb through status on hash visitors
afeb1ad04 <Micah Kornfield> add real jira
96858bd52 <Micah Kornfield> Use macro inversion to reduce boiler plate
0dd007718 <Micah Kornfield> minimal test
57349f7ea <Micah Kornfield> unit tests passing
34834f711 <Alessandro Andrioni> First try at implementing a CountValues kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants