-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13764: [C++] Support CountOptions in grouped count distinct #11011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm still digging into this, but I noticed the unit tests are especially slow in debug mode. This is because we make a 2 GiB size allocation in GrouperFastImpl::GetUniques, which also makes me worry that these tests might use too much memory for CI. |
|
Also, filtering is not the right tradeoff to make: I think we don't care too much about the ONLY_NULL case and in the other cases, counting nulls is not that big a deal. So I'll update this to do the filtering in Finalize(). |
|
Changed to filter the results after consumption. For whatever reason, not filtering before feeding batches into the grouper doesn't cause the runtime/memory blowup. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. A couple comments, but looks mostly good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values->null_bitmap() may be null here, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a test case where none of the values is null?
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add that the setting doesn't apply to the grouping keys?
Really? Why does this happen? |
I was trying to figure out why, but didn't get very far in untangling the internals before I decided that filtering before grouping was pointless anyways. This doesn't happen the current implementation, I might poke at this further later though. |
|
Thanks for the suggestions. I believe I've addressed everything. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you!
This works by filtering the values (trading off time for memory usage).