-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1572: [C++] Implement "value counts" kernels for tabulating value frequencies #3579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for picking this up from where I left off, I'll close my original PR. |
|
@andrioni Thanks for the contribution |
62d23fd to
9a5c9ca
Compare
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes:
- May I propose to rename this to
Distribution? - We should also count Nulls.
- The current output format is 2 arrays. I think we should formalize this and return a single array of
List<Struct<Option<value>, count>>explicitly which should allow returning the count for the null value.
This can be done in another ticket/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, you should set the datum type if we ever refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify? I might be missing it but I don't see an API to set the type explicitly (instead it looks like it gets derived when value that is assigned in the underlying variant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's handled by the variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my bad.
|
Rebased |
|
@fsaintjacques thanks for the review. My thoughts:
|
30dc7d5 to
8779edd
Compare
cpp/src/arrow/compute/kernels/hash.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could potentially return a StructArray directly.
8779edd to
8c77e19
Compare
193d110 to
a6b9364
Compare
a24af0b to
123c6c0
Compare
|
@fsaintjacques I think I fixed the windows build (will check in the morning). I changed the output to be a Struct as suggested. I will file follow-up JIRAs for the rest. |
cpp/src/arrow/util/hashing.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put a DCHECK_GE(start, 0)` just to make sure it's positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also put a ASSERT_LE in and changed to std::copy which i think fixes another bug.
0176819 to
ae08e40
Compare
|
Clean build 👍 |
|
I want to give a last review, but I don't think anything is blocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make Values and Count a public constexpr such that user can programatically retrieve the field by name at compile time, maybe expose the index that way too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Done.
|
Ready to merge! |
|
I think you need to rebase for the failing appveyor failure, but I'd say this is safe to merge. LGTM +1. |
7825282 to
2845381
Compare
|
One thing that I'll do which I'm not sure will help too much is separate out the memcpy fix into it's own PR so we can isolate performance problems to the kernel |
|
I think there might also be a way forward with template magic to eliminate this altogether. I'll get an update out soon |
cdd6638 to
72095eb
Compare
|
With this change it looks like they are fairly close on my box (rerunning causes changes in different directions in different metrics) Raw text: |
|
Looking at this again and then merging, thanks for your patience |
|
I ran on a mobile Xeon 2.8 ghz with the cores pinned at 2.8ghz / no turbo boost and things looking OK here before after merging. thanks @emkornfield! |

Picked up from: #1970
I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code. Some other changes I mae:
If this approach looks OK for now. I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out.
other things to discuss: