ARROW-10070: [C++][Compute] Implement var and std aggregate kernel #8269

cyb70289 · 2020-09-25T04:29:25Z

No description provided.

github-actions · 2020-09-25T04:32:08Z

https://issues.apache.org/jira/browse/ARROW-10070

cyb70289 · 2020-09-25T06:47:16Z

Dev / Lint ci failure looks not related.

INFO:archery:Running Docker linter
apache-rat license violation: go/arrow/flight/Flight_grpc.pb.go
apache-rat license violation: go/arrow/flight/example_flight_server_test.go
Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.

missing license header?
Will it related to pr #8175? @zeroshade @wesm

kszucs · 2020-09-25T14:04:10Z

missing license header?

Yes, #8273 should fix that.

jorisvandenbossche · 2020-09-26T07:37:30Z

We might want to have an option to specify the denominator? (whether it is n or n - 1, to compute population vs sample standard deviation) As some examples, numpy has a ddof keyword, postgres or clickhouse have separate functions for both, julia has a corrected keyword.

cyb70289 · 2020-09-26T15:22:19Z

We might want to have an option to specify the denominator? (whether it is n or n - 1, to compute population vs sample standard deviation)

Thank you, will do.

cyb70289 · 2020-09-27T04:13:25Z

Added option ddof to control stdev divisor. Same as numpy.std.

nealrichardson · 2020-09-28T15:26:37Z

Two notes:

Naming: I've never seen this called stdev anywhere. stddev is common, in numpy and julia it's std, in R it's sd. Let's go with one of those. Maybe just add an extra "d"?
Since sd = sqrt(var) (https://github.com/apache/arrow/pull/8269/files#diff-461bd7e445c2a190f1173ebdefa21002R106), would it make sense to implement variance (i.e. most of this patch), and then standard deviation as the sqrt of that? That way we get two kernels (or even three, if sqrt is exposed as a kernel too).

cyb70289 · 2020-09-29T02:00:13Z

Thanks @nealrichardson

1. Naming: I've never seen this called `stdev` anywhere. `stddev` is common, in numpy and julia it's `std`, in R it's `sd`. Let's go with one of those. Maybe just add an extra "d"?

Naming is always the hardest thing :) Looks std is used more often, and it's short.
AFAIK, stdev is used in excel (the most popular statistic software I guess? :)

2. Since `sd = sqrt(var)` (https://github.com/apache/arrow/pull/8269/files#diff-461bd7e445c2a190f1173ebdefa21002R106), would it make sense to implement variance (i.e. most of this patch), and then standard deviation as the sqrt of that? That way we get two kernels (or even three, if sqrt is exposed as a kernel too).

I also thought about the var kernel. Will update this patch to include it.

cyb70289 · 2020-09-29T05:19:41Z

Added variance kernel var. Renamed standard deviation kernel to std.

pitrou · 2020-09-29T12:24:45Z

docs/source/cpp/compute.rst

Would rather "stddev" and "variance" respectively. "std" and "var" can easily be misunderstood, IMHO.

cpp/src/arrow/compute/kernels/aggregate_var_std.cc

pitrou · 2020-09-29T12:43:55Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Ironically, the fact that this is random identically-distributed data means that the slices will have similar means and variances. Perhaps making the array smaller would make the test more significant.

pitrou · 2020-09-29T12:44:53Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Can we also test with chunks of very different sizes? For example [1, 2, 3, 4, 5, 6, 7] and [8].

pitrou · 2020-09-29T12:45:42Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

pitrou · 2020-09-29T12:47:04Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Can we also test with ddof = 1? Though with such a large array size, the change may be rather small.

arw2019 · 2020-09-30T06:24:04Z

cpp/src/arrow/compute/api_aggregate.h

Naming suggestion: how about DdofOptions?

I'm not sure if we will add options later, e.g., to ignore NaN

I agree there might be other options in the future. But with the renaming of the functions, maybe call it VarianceOptions instead?

Kind of struggling as stddev kernel uses this same option.
Maybe export both VarianceOptions and StddevOptions and alias them as same type internally?

Only one is useful. Just VarianceOptions IMHO.

pitrou

+1, thank you @cyb70289

jorisvandenbossche · 2020-09-30T11:57:04Z

I suppose the behaviour with NaN is that any NaN in the input gives NaN as result? That might be worth adding a test for?

pitrou · 2020-09-30T12:23:33Z

Since we're not doing anything special, it certainly will. I think that can be tackled in a separate JIRA, when adding a nan handling option perhaps.

cyb70289 changed the title ~~ARROW-10070: [C++][Compute] Implement stdev aggregate kernel~~ ARROW-10070: [C++][Compute] Implement var and std aggregate kernel Sep 29, 2020

pitrou reviewed Sep 29, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/aggregate_var_std.cc Outdated Show resolved Hide resolved

pitrou reviewed Sep 29, 2020

View reviewed changes

arw2019 reviewed Sep 30, 2020

View reviewed changes

cyb70289 and others added 6 commits September 30, 2020 13:25

ARROW-10070: [C++][Compute] Implement stdev aggregate kernel

1610207

Add options to control ddof

12f72a5

Add variance kernel

43318f8

Refine naming and test case

dbf7278

Fix build error on windows

1957a2a

VarianceOptions

1c77f16

pitrou approved these changes Sep 30, 2020

View reviewed changes

pitrou closed this in ffb6e28 Sep 30, 2020

cyb70289 deleted the agg-stdev branch September 30, 2020 14:22

asfimport mentioned this pull request Sep 30, 2020

[C++][Compute] Implement stdev aggregate kernel #26089

Closed

ARROW-10070: [C++][Compute] Implement var and std aggregate kernel #8269

ARROW-10070: [C++][Compute] Implement var and std aggregate kernel #8269

Uh oh!

Conversation

cyb70289 commented Sep 25, 2020

Uh oh!

github-actions bot commented Sep 25, 2020

Uh oh!

cyb70289 commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kszucs commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 26, 2020

Uh oh!

cyb70289 commented Sep 26, 2020

Uh oh!

cyb70289 commented Sep 27, 2020

Uh oh!

nealrichardson commented Sep 28, 2020

Uh oh!

cyb70289 commented Sep 29, 2020

Uh oh!

cyb70289 commented Sep 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Sep 30, 2020

Uh oh!

pitrou commented Sep 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cyb70289 commented Sep 25, 2020 •

edited

Loading

kszucs commented Sep 25, 2020 •

edited

Loading