Skip to content

Conversation

@cyb70289
Copy link
Contributor

No description provided.

@github-actions
Copy link

@cyb70289
Copy link
Contributor Author

cyb70289 commented Sep 25, 2020

Dev / Lint ci failure looks not related.

INFO:archery:Running Docker linter
apache-rat license violation: go/arrow/flight/Flight_grpc.pb.go
apache-rat license violation: go/arrow/flight/example_flight_server_test.go
Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.

missing license header?
Will it related to pr #8175? @zeroshade @wesm

@kszucs
Copy link
Member

kszucs commented Sep 25, 2020

missing license header?

Yes, #8273 should fix that.

@jorisvandenbossche
Copy link
Member

We might want to have an option to specify the denominator? (whether it is n or n - 1, to compute population vs sample standard deviation) As some examples, numpy has a ddof keyword, postgres or clickhouse have separate functions for both, julia has a corrected keyword.

@cyb70289
Copy link
Contributor Author

We might want to have an option to specify the denominator? (whether it is n or n - 1, to compute population vs sample standard deviation)

Thank you, will do.

@cyb70289
Copy link
Contributor Author

Added option ddof to control stdev divisor. Same as numpy.std.

@nealrichardson
Copy link
Member

Two notes:

  1. Naming: I've never seen this called stdev anywhere. stddev is common, in numpy and julia it's std, in R it's sd. Let's go with one of those. Maybe just add an extra "d"?
  2. Since sd = sqrt(var) (https://github.com/apache/arrow/pull/8269/files#diff-461bd7e445c2a190f1173ebdefa21002R106), would it make sense to implement variance (i.e. most of this patch), and then standard deviation as the sqrt of that? That way we get two kernels (or even three, if sqrt is exposed as a kernel too).

@cyb70289
Copy link
Contributor Author

Thanks @nealrichardson

1. Naming: I've never seen this called `stdev` anywhere. `stddev` is common, in numpy and julia it's `std`, in R it's `sd`. Let's go with one of those. Maybe just add an extra "d"?

Naming is always the hardest thing :) Looks std is used more often, and it's short.
AFAIK, stdev is used in excel (the most popular statistic software I guess? :)

2. Since `sd = sqrt(var)` (https://github.com/apache/arrow/pull/8269/files#diff-461bd7e445c2a190f1173ebdefa21002R106), would it make sense to implement variance (i.e. most of this patch), and then standard deviation as the sqrt of that? That way we get two kernels (or even three, if sqrt is exposed as a kernel too).

I also thought about the var kernel. Will update this patch to include it.

@cyb70289 cyb70289 changed the title ARROW-10070: [C++][Compute] Implement stdev aggregate kernel ARROW-10070: [C++][Compute] Implement var and std aggregate kernel Sep 29, 2020
@cyb70289
Copy link
Contributor Author

Added variance kernel var. Renamed standard deviation kernel to std.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather "stddev" and "variance" respectively. "std" and "var" can easily be misunderstood, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironically, the fact that this is random identically-distributed data means that the slices will have similar means and variances. Perhaps making the array smaller would make the test more significant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test with chunks of very different sizes? For example [1, 2, 3, 4, 5, 6, 7] and [8].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test with ddof = 1? Though with such a large array size, the change may be rather small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion: how about DdofOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we will add options later, e.g., to ignore NaN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there might be other options in the future. But with the renaming of the functions, maybe call it VarianceOptions instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of struggling as stddev kernel uses this same option.
Maybe export both VarianceOptions and StddevOptions and alias them as same type internally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one is useful. Just VarianceOptions IMHO.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @cyb70289

@jorisvandenbossche
Copy link
Member

I suppose the behaviour with NaN is that any NaN in the input gives NaN as result? That might be worth adding a test for?

@pitrou
Copy link
Member

pitrou commented Sep 30, 2020

Since we're not doing anything special, it certainly will. I think that can be tackled in a separate JIRA, when adding a nan handling option perhaps.

@pitrou pitrou closed this in ffb6e28 Sep 30, 2020
@cyb70289 cyb70289 deleted the agg-stdev branch September 30, 2020 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants