Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Mar 12, 2021

@github-actions
Copy link

@rok
Copy link
Member Author

rok commented Mar 12, 2021

@nealrichardson what do you think about this approach? It introduces overhead to because it transposes dictionary indices but it gives us value_counts.

@nealrichardson nealrichardson requested a review from pitrou March 12, 2021 16:13
@nealrichardson
Copy link
Member

I'm not familiar with this C++ code so I'll let others comment (cc @pitrou @bkietz @michalursa). It looks like the issue is only with ChunkedArrays where the chunks have different dictionaries? My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate.

@rok rok closed this Mar 12, 2021
@rok rok reopened this Mar 12, 2021
@rok
Copy link
Member Author

rok commented Mar 12, 2021

My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate.

Indeed unifying over all chunks first and then transposing individual chunk indices would be a better idea!

I'm still a bit unfamiliar with kernel mechanics but I'm thinking implementing a new kernel for chunked DictionaryArrays with different dictionaries will be the best way to go for this.

@pitrou
Copy link
Member

pitrou commented Mar 16, 2021

There are indeed two possible approaches:

  • unify all chunks first, and then run the unique kernel over the transposed incides (as proposed by @rok)
  • run the unique kernel over the original chunks, and then hash-aggregate the unique results of the different chunks (in effect SELECT sum(counts) GROUP BY values)

The second approach could be faster in the (unusual?) cases where only a small subset of dictionary values actually appear in the data. If most dictionary values are used, both cases should have similar performance, though.

Since we don't have a generic hash-aggregate yet, the first approach sounds good enough.
(also note that unique is in itself a special case of hash-aggregation)

cc @bkietz for opinions

@rok
Copy link
Member Author

rok commented Mar 22, 2021

Since we don't have a generic hash-aggregate yet, the first approach sounds good enough.
(also note that unique is in itself a special case of hash-aggregation)

Shall I then fix CI issues and we proceed with the first approach?
Or do we rather put effort into generic hash-aggregate?

@pitrou
Copy link
Member

pitrou commented Mar 22, 2021

You can fix CI issus IMHO.

@rok rok marked this pull request as ready for review March 25, 2021 16:19
@rok
Copy link
Member Author

rok commented Mar 25, 2021

This is ready for review - the java issue appears to be a flaky upload.

@rok
Copy link
Member Author

rok commented Mar 29, 2021

ping :)

@rok rok force-pushed the ARROW-10403 branch 3 times, most recently from 293a6fc to 4d00295 Compare April 7, 2021 22:29
@pitrou pitrou changed the title ARROW-10403: [C++] Implement unique kernel for dictionary type ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays Apr 8, 2021
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Sorry for the delay, will merge if CI is green.

@pitrou pitrou closed this in b24cff9 Apr 8, 2021
@rok
Copy link
Member Author

rok commented Apr 8, 2021

Thanks @pitrou!
Should we make a jira for the generic hash aggregate?

@pitrou
Copy link
Member

pitrou commented Apr 8, 2021

Yes, please do!

@rok
Copy link
Member Author

rok commented Apr 8, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants