Skip to content

Improve performance and lower memory usage of GROUP BY with novel method.#10956

Closed
palasonic1 wants to merge 12 commits intoClickHouse:masterfrom
palasonic1:palasonic-draft-group-by
Closed

Improve performance and lower memory usage of GROUP BY with novel method.#10956
palasonic1 wants to merge 12 commits intoClickHouse:masterfrom
palasonic1:palasonic-draft-group-by

Conversation

@palasonic1
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
placeholder

Detailed description / Documentation draft:
placeholder

@alexey-milovidov alexey-milovidov changed the title group by using shared method Improve performance and lower memory usage of GROUP BY with novel method. May 16, 2020
@blinkov blinkov added doc-alert pr-feature Pull request with new product feature labels May 16, 2020
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Maxim Serebryakov seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov
@nickitat this is at least worth reading:
https://presentations.clickhouse.com/hse_2020/4th/GroupBySpeedup_pres.pdf
https://presentations.clickhouse.com/hse_2020/4th/GroupBySpeedup_full.pdf
#10956

@nickitat
I’ve read it, the experiment itself looks interesting. I personally don’t really believe that the fastest aggr implementation would be a concurrent one (and we see that even on 32 threads the author was made to increase number of buckets twice because of contention). It should be parallel (data parallel). So IMO more perspective direction would be to try to implement splitting aggregator as efficient as possible. maybe vectorise hash calculation, do not copy rows (only create vector of indices for each partition), reuse calculated hash, maybe smth else. This approach will do only one insertion into HT, have constant overhead per row and no scalability issues. it would be also reusable in distinct, limit by, maybe window functions.
wdyt? (edited)

@alexey-milovidov
Yes, I also think similarly. This approach is also harder to use with distributed aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants