[C++][Compute] Implement count_distinct/distinct hash aggregate kernels 

Implement count distinct aggregate reusing hash table from hash group by inside of it.

This brings support to SQL queries like:
select a, count(distinct b), count(distinct c) from t group by a

For instance to compute count(distinct b), the first group id mapping will give group id based on column a value; then the second group id mapping is done using the key (groupid(a), b) inside count(distinct b) aggregate (similarly for count(distinct c)). 
After all input rows are consumed, the final processing step scans the hash tables based on (groupid(a), b) and updates an array of counts indexed by groupid(a). 
The resulting array of counts represents the output of count distinct aggregate.

**Reporter**: [Michal Nowakiewicz](https://issues.apache.org/jira/browse/ARROW-12728) / @michalursa
**Assignee**: [David Li](https://issues.apache.org/jira/browse/ARROW-12728) / @lidavidm
**Watchers**: [Rok Mihevc](https://issues.apache.org/jira/browse/ARROW-12728) / @rok
#### Related issues:
- [[C++] Query engine umbrella issue](https://github.com/apache/arrow/issues/28385) (is a child of)
- [[C++] Implement hash_aggregate kernels (umbrella issue)](https://github.com/apache/arrow/issues/29014) (is a child of)
- [[C++][Compute] Implement non-hash count_distinct aggregate kernel](https://github.com/apache/arrow/issues/29633) (is related to)
- [[R] Binding for n_distinct()](https://github.com/apache/arrow/issues/29261) (is depended upon by)
#### PRs and other links:
- [GitHub Pull Request #10876](https://github.com/apache/arrow/pull/10876)

<sub>**Note**: *This issue was originally created as [ARROW-12728](https://issues.apache.org/jira/browse/ARROW-12728). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Compute] Implement count_distinct/distinct hash aggregate kernels #28470

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Compute] Implement count_distinct/distinct hash aggregate kernels #28470

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions