[CH] Adaptive sort memory controll and support memory sort shuffle#5893
Merged
baibaichen merged 9 commits intoapache:mainfrom May 30, 2024
Merged
[CH] Adaptive sort memory controll and support memory sort shuffle#5893baibaichen merged 9 commits intoapache:mainfrom
baibaichen merged 9 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format? See also: |
|
Run Gluten Clickhouse CI |
c32ce51 to
cf82dbf
Compare
|
Run Gluten Clickhouse CI |
1 similar comment
|
Run Gluten Clickhouse CI |
zhanglistar
reviewed
May 29, 2024
| if (!backend_conf_map.contains(CH_RUNTIME_SETTINGS_PREFIX + "prefer_external_sort_block_bytes")) | ||
| { | ||
| auto mem_gb = task_memory / static_cast<double>(1_GiB); | ||
| // 2.8x+5, Heuristics calculate the block size of external sort, [8,16] |
Contributor
There was a problem hiding this comment.
Just curious, how to get this formula?
Contributor
Author
There was a problem hiding this comment.
大概定了一下 1G 8M 2G 10M 3G 14M 4G 16M四个数据点然后做了一个线性回归,数据点是跟据测试效果大致选择的
|
Run Gluten Clickhouse CI |
a546bbe to
ca29456
Compare
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
ca29456 to
ec11bdc
Compare
|
Run Gluten Clickhouse CI |
Contributor
|
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
优化了排序阶段的spill机制,并且根据spark task的offheap内存配置自动调整相关参数,保证在高列数(>300)的情况下稳定运行。目前能够稳定运行 300列,300+GB高压缩比 parquet分区表导入任务,不需要额外配置参数。
实现了新的排序shuffle机制,降低了排序过程中的合并开销,同时在celeborn下不再需要数据落盘。
新增加shuffle算法的切换逻辑,当列数超过一定数量,或者分区数超过300时,切换为MemorySortShuffle
How was this patch tested?
unit tests
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)