ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays #9683

rok · 2021-03-12T00:50:52Z

github-actions · 2021-03-12T01:03:45Z

https://issues.apache.org/jira/browse/ARROW-10403

rok · 2021-03-12T11:35:42Z

@nealrichardson what do you think about this approach? It introduces overhead to because it transposes dictionary indices but it gives us value_counts.

nealrichardson · 2021-03-12T16:33:39Z

I'm not familiar with this C++ code so I'll let others comment (cc @pitrou @bkietz @michalursa). It looks like the issue is only with ChunkedArrays where the chunks have different dictionaries? My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate.

rok · 2021-03-12T17:20:35Z

My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate.

Indeed unifying over all chunks first and then transposing individual chunk indices would be a better idea!

I'm still a bit unfamiliar with kernel mechanics but I'm thinking implementing a new kernel for chunked DictionaryArrays with different dictionaries will be the best way to go for this.

pitrou · 2021-03-16T17:02:43Z

There are indeed two possible approaches:

unify all chunks first, and then run the unique kernel over the transposed incides (as proposed by @rok)
run the unique kernel over the original chunks, and then hash-aggregate the unique results of the different chunks (in effect SELECT sum(counts) GROUP BY values)

The second approach could be faster in the (unusual?) cases where only a small subset of dictionary values actually appear in the data. If most dictionary values are used, both cases should have similar performance, though.

Since we don't have a generic hash-aggregate yet, the first approach sounds good enough.
(also note that unique is in itself a special case of hash-aggregation)

cc @bkietz for opinions

rok · 2021-03-22T13:50:04Z

Since we don't have a generic hash-aggregate yet, the first approach sounds good enough.
(also note that unique is in itself a special case of hash-aggregation)

Shall I then fix CI issues and we proceed with the first approach?
Or do we rather put effort into generic hash-aggregate?

pitrou · 2021-03-22T14:03:41Z

You can fix CI issus IMHO.

rok · 2021-03-25T20:36:10Z

This is ready for review - the java issue appears to be a flaky upload.

rok · 2021-03-29T20:16:45Z

ping :)

pitrou

+1. Sorry for the delay, will merge if CI is green.

rok · 2021-04-08T15:27:23Z

Thanks @pitrou!
Should we make a jira for the generic hash aggregate?

pitrou · 2021-04-08T15:28:44Z

Yes, please do!

rok · 2021-04-08T16:05:13Z

https://issues.apache.org/jira/browse/ARROW-12301

github-actions bot added the Component: C++ label Mar 12, 2021

nealrichardson requested a review from pitrou March 12, 2021 16:13

rok closed this Mar 12, 2021

rok reopened this Mar 12, 2021

rok force-pushed the ARROW-10403 branch from 95db710 to a9ad213 Compare March 25, 2021 16:15

rok marked this pull request as ready for review March 25, 2021 16:19

rok force-pushed the ARROW-10403 branch from a9ad213 to 2b750c2 Compare March 25, 2021 16:48

rok force-pushed the ARROW-10403 branch from 2b750c2 to 785e600 Compare March 29, 2021 20:16

rok force-pushed the ARROW-10403 branch 3 times, most recently from 293a6fc to 4d00295 Compare April 7, 2021 22:29

pitrou changed the title ~~ARROW-10403: [C++] Implement unique kernel for dictionary type~~ ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays Apr 8, 2021

rok and others added 2 commits April 8, 2021 11:21

Changing DictionaryHashKernel.

ec9dd24

Add comment about complexity

eb63876

pitrou force-pushed the ARROW-10403 branch from 4d00295 to eb63876 Compare April 8, 2021 09:26

pitrou approved these changes Apr 8, 2021

View reviewed changes

pitrou closed this in b24cff9 Apr 8, 2021

This was referenced Apr 8, 2021

[C++] Implement unique kernel for dictionary type #26386

Closed

[C++][Compute] Use generic hash-aggregate for DictionaryArrays #28104

Open

js8544 mentioned this pull request Aug 22, 2023

[C++][Python] value_counts extremely slow for chunked DictionaryArray #37055

Closed

js8544 mentioned this pull request Oct 23, 2023

GH-37055: [C++] Optimize hash kernels for Dictionary ChunkedArrays #38394

Merged

ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays #9683

ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays #9683

Uh oh!

Conversation

rok commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 12, 2021

Uh oh!

rok commented Mar 12, 2021

Uh oh!

nealrichardson commented Mar 12, 2021

Uh oh!

rok commented Mar 12, 2021

Uh oh!

pitrou commented Mar 16, 2021

Uh oh!

rok commented Mar 22, 2021

Uh oh!

pitrou commented Mar 22, 2021

Uh oh!

rok commented Mar 25, 2021

Uh oh!

rok commented Mar 29, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

rok commented Apr 8, 2021

Uh oh!

pitrou commented Apr 8, 2021

Uh oh!

rok commented Apr 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rok commented Mar 12, 2021 •

edited

Loading