Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

Implements the mean (average) kernel aggregates on numeric columns. The final type is always a double. Refactored the Sum kernel implementation to share common parts, notably the consume part is identical. Only the Finalize and output type differ.

@fsaintjacques fsaintjacques force-pushed the ARROW-3121-mean-aggregate branch 2 times, most recently from 7f31a91 to 91d9b63 Compare February 22, 2019 02:21
@codecov-io
Copy link

codecov-io commented Feb 22, 2019

Codecov Report

Merging #3708 into master will increase coverage by 0.02%.
The diff coverage is 94.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3708      +/-   ##
==========================================
+ Coverage   87.65%   87.68%   +0.02%     
==========================================
  Files         701      703       +2     
  Lines       85747    85824      +77     
  Branches     1200     1200              
==========================================
+ Hits        75162    75254      +92     
+ Misses      10466    10456      -10     
+ Partials      119      114       -5
Impacted Files Coverage Δ
cpp/src/arrow/util/bit-util.h 100% <100%> (ø) ⬆️
cpp/src/arrow/util/bit-util-test.cc 98.9% <100%> (+0.01%) ⬆️
cpp/src/arrow/compute/kernels/aggregate-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/compute/test-util.h 94.73% <100%> (+4.73%) ⬆️
cpp/src/arrow/compute/kernels/sum.cc 87.8% <66.66%> (-4.79%) ⬇️
cpp/src/arrow/compute/kernels/mean.cc 87.8% <87.8%> (ø)
cpp/src/arrow/compute/kernels/sum-internal.h 93.84% <93.84%> (ø)
go/arrow/math/uint64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_sse4_amd64.go 0% <0%> (-100%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9814605...2d4c2ba. Read the comment docs.

@fsaintjacques fsaintjacques force-pushed the ARROW-3121-mean-aggregate branch from 02a0bcf to 2d4c2ba Compare February 22, 2019 12:33
@fsaintjacques
Copy link
Contributor Author

I think the nodejs failure is unrelated and this is ready to merge.

@fsaintjacques
Copy link
Contributor Author

@bkietz @pitrou @emkornfield This is ready to review.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should assert false or abort on other datum kinds, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also try to include this while implementing ARROW-3123.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

I think you want typename std::enable_if<!IsFloatingPoint<Type>::value>::type, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, shame on me! Clearly something is wrong then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and ConsumeSparse, ConsumeTiny) looks pretty handy, maybe later we could refactor it into BitUtil::VisitBitmap() or so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soon the complexity of future needs will outgrow the current implementation, e.g. we'll need to mix with selection vectors/bitmap, groups, etc. I intend to open a discussion in the mailing list about providing iterator facilities with many options (batch, masked, groups).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crazy edge case: if we're summing floating point and one of the null masked values happens to be NaN then multiplying by zero will not prevent the sum from also being NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll see what I can do without too much change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance diff of doing ternary operator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would it pay to have a constant expression (or just static array) in bit utils that precomputes all the shifts similar to the popcount array below

@fsaintjacques fsaintjacques force-pushed the ARROW-3121-mean-aggregate branch from 2d4c2ba to 7929140 Compare February 26, 2019 02:40
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I saw any blocking issues mostly style stuff and documentation suggestions (please do fix the NaN issue raised).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance diff of doing ternary operator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou
Copy link
Member

pitrou commented Feb 26, 2019 via email

@fsaintjacques
Copy link
Contributor Author

@pitrou, the initial (sum) implementation used the visitor facility.

@fsaintjacques
Copy link
Contributor Author

This is good to go, the failure is from python https://issues.apache.org/jira/browse/ARROW-4684

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Datum(std::make_shared<ScalarType>(0.0)));

ValidateMean<TypeParam>(&this->ctx_, "[1, 1, 1, 1, 1, 1, 1, 1]",
Datum(std::make_shared<ScalarType>(1.0)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For self-documenting code you could do:

auto null = std::make_shared<ScalarType>(0.0, false);

Also keep in mind that Datum has implicit ctors, so these Datum(...) are unneeded

if (data_type == nullptr)
return Status::Invalid("Datum must be array-like");
else if (!is_integer(data_type->id()) && !is_floating(data_type->id()))
return Status::Invalid("Datum must contain a NumericType");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to always use braces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants