-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12728: [C++] Implement count_distinct/distinct hash aggregate kernels #10876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Actually, if we want to keep the |
|
Changed to use Take() to assemble the final results instead of Concatenate(). |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good, just a few high level comments and nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michalursa what do you think of adding Grouper::Reserve(additional_capacity_hint)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys here are pairs of [value, group_id] so just having the number of groups doesn't give much of a capacity hint.
|
@lidavidm needs rebase |
|
Rebased, thanks for the heads up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: (value, group_id) pairs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Thanks, fixed.
This reuses the Grouper to add kernels for count_distinct and distinct. The implementation for distinct is very simple and not particularly efficient, leveraging Concatenate to build the output list. Neither kernel handles larger-than-memory processing.