Improve memory reuse efficiency and reduce page faults if using two-level hash tables#80245
Conversation
|
Does this PR update to a development version of jemalloc? |
Hi @rschu1ze, This PR will update to a dev version of jemalloc. There is a significant performance improvement when queries use two-level hash tables for aggregation. Each thread will have 256 sub-hash tables for each two-level hash table. For example, if the initial size of a sub-hash table is 4K (256 cells * 16 bytes/cell) and it needs to reallocate from 4K to 16K (1024 cells * 16 bytes/cell) based on historical cache data, there will be 256 reallocation requests (one for each sub-hash table) from 4K to 16K sent to jemalloc. After the aggregations, many of the 256 16K memory pieces will merge into much larger blocks. However, jemalloc will not choose, split, and reuse from these large memory blocks if their size is significantly larger than the subsequent requests. Consequently, new requests will have to allocate from the already freed memory To avoid these issues, if we don't update the jemalloc version, we might need to work around by passing variables in settings to jemalloc or changing the management of memory blocks in two-level hash tables. However, both approaches may introduce other concerns. By the way, do we have any tests if we decide to update the submodule version? I also noticed that the latest jemalloc version ([jemalloc @ 41a859e]) in ClickHouse is on the dev branch. |
|
@jiebinn First of all, thanks for the PR.
That's unfortunately a problem for ClickHouse, especially for something so fundamental as memory management. The jemalloc homepage says (not surprisingly) "dev: The dev branch tracks current development, and at any given time may not be production-ready.". Unfortunately, they seem to make stable releases only rarely. However, this is also only my opinion, I can't tell how reliable dev versions of jemalloc are really.
This is the exact version which ClickHouse is using: It is only ten (or so) commits ahead of tag |
|
@rschu1ze, thank you for the quick response. I will inquire with Jemalloc about their plans for releasing a new stable version soon. If they don't have one scheduled, would you consider temporarily cherry-picking the performance commit? |
|
@jiebinn Yes, but only if it has no dependencies to previous commits on the development branch and if it is straightforward. We should avoid using ClickHouse as a canary for finding bugs in jemallocs dev branch. A stable jemalloc release would be definitely be the preferred route. |
|
I think using dev branches it's ok, but it depends on the project policy. If their dev branch is considered stable and final (think of absl for example) then there is no issue. If their dev branch is a candidate, which they will cut at some point and then fix bugs before release, then it's probably not ok. In general (but it's my experience) jemalloc is usually rock solid, so I expect it to be on the first case. OTOH, running any branch in CI and compare / detect issues is ok. Based on findings and results we can decide whether the risk is worth or not. |
|
@rschu1ze , I agree with that if we cherry-pick one commit temporarily, the patch should be clear and simple, and in good quality. @Algunenano , I believe jemalloc is quite solid. I agree with that we can decide whether to update to a dev branch based on our findings and CI or other test results. I'll first ask jemalloc if there are any plans to release a new stable version soon, as that would be the best solution. If there are no plans, we might consider cherry-picking this commit (jemalloc/jemalloc#2842), using the dev branch, or implementing a workaround in ClickHouse. |
|
Okay, tnx. @jiebinn To let our CI test your PR, please fix the build issues ... once all builds are green, functional/perf/end-to-end tests will start. But by the looks of things, it seems the build problems are in jemalloc itself. To fix, feel free to replace the official jemalloc submodule ( Also, to fix the build, please open a build log (e.g. the one I linked above ^^), then grep for which you can run locally to reproduce (only remove trash like |
2ce9e0b to
739e5b5
Compare
|
Hi @rschu1ze and @Algunenano ,we can have a safer and more convenient method to address the high page faults issue when queries use two-level hash tables in ClickHouse. Previously, the default |
|
@jiebinn it seems you removed the update of the submodule in one of the last force pushes, so the PR is not doing anything except changing |
Yes. This PR will change the |
|
@rschu1ze, it's ok to use the development branch. Our policy - ignore everything that library developers say about their releases and apply ClickHouse CI to it. |
I don't quite understand sorry. Does this mean there will be 2 PRs? One (this one) with just changes to |
| # MADV_DONTNEED. See | ||
| # https://github.com/ClickHouse/ClickHouse/issues/11121 for motivation. | ||
| set (JEMALLOC_CONFIG_MALLOC_CONF "percpu_arena:percpu,oversize_threshold:0,muzzy_decay_ms:0,dirty_decay_ms:5000,prof:true,prof_active:false,background_thread:true") | ||
| set (JEMALLOC_CONFIG_MALLOC_CONF "percpu_arena:percpu,oversize_threshold:0,muzzy_decay_ms:0,dirty_decay_ms:5000,prof:true,prof_active:false,background_thread:true,lg_extent_max_active_fit:8") |
There was a problem hiding this comment.
Please consider leaving a comment explaining why the change to lg_extent_max_active_fit. Also, do we need it for non Linux?
There was a problem hiding this comment.
Thanks. I will add a comment to explain the reason. This optimization is commonly used for both Linux and non-Linux systems. However, I have not tested it on non-Linux systems due to the lack of access to such environments.
There was a problem hiding this comment.
@Algunenano Maybe we should consider applying the change to both Linux and Non-Linux system. I will check the CI result.
There was a problem hiding this comment.
We don't test on non-Linux systems, but I'd apply it for consistency
Hi @Algunenano , we only need to keep this PR to change |
This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing dirty extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: Zhiguo Zhou <zhiguo.zhou@intel.com>
|
@jiebinn I'm trying to reproduce the improvements in the description but I don't see pretty much any change:
Testing with a
Master: localhost:49000, queries: 50, QPS: 1.695, RPS: 169451896.802, MiB/s: 646.408, result RPS: 16.946, result MiB/s: 0.001. I'm waiting for the perf report to run again, but if you could explain how/what are you measuring that'd be great. Does this problem only appear when running with large amount of threads ( I see this in the build: So I'm assuming it's applied correctly. |
Hi @Algunenano, the PR is not related to the system's core number. If each sub-hashtable in the two-level hashtables falls within the size range of jemalloc-defined large extents (16KB to |
|
No noticeable changes in performance tests; I only see it in the microbenchmarks. Still it seems ok to include and analyze in larger perf tests as part of the release |
9fb0fb1
…s to change jemalloc conf (#57076) To reduce the page fault like clickhouse do the work Related PR: #clickhouse](ClickHouse/ClickHouse#80245)
…s to change jemalloc conf (#57076) To reduce the page fault like clickhouse do the work Related PR: #clickhouse](ClickHouse/ClickHouse#80245)
…s to change jemalloc conf (#57076) To reduce the page fault like clickhouse do the work Related PR: #clickhouse](ClickHouse/ClickHouse#80245)
This patch will change the default
lg_extent_max_active_fitfrom 6 to 8. It will enhance hot dirty memory reuse when using two-level hashtables.Performance issue analysis:
We identified the performance issue in many of the 43 ClickBench queries with ClickHouse on the 2x240 vCPUs system, particularly high page faults. For a deeper investigation, let's consider Q35. Q35 exhibited around 40%
__handle_mm_faulthotspot in cycles on the GNR 2x240 vCPUs platform. We discovered that the high page faults stem fromMADV_DONTNEEDwith theperf event.In Query 35, there are 256 memory reallocations (sub-hashtables) from 4KB to 16KB for each arena. Jemalloc recycles and coalesces many of these 256 16K memory blocks into a larger one when ClickHouse frees them with the
bpftracedata. Subsequently, the large memory space in the dirty ecache cannot be reused for the next following 16K request, as the maximum allowed memory space before splitting is 16 * 64 KB. When a new memory request is made, Jemalloc locates an existing extent larger than the requested size in the dirty ecache and splits it to fit the actual requested size, maximizing memory reuse. There is a boundary for the requested extent size and the existing extent size, with a maximum ratio of 64 to minimize memory fragmentation. Consequently, Jemalloc would find possible memory space from the retained ecache (alreadyMADV_DONTNEEDand lacking physical pages) or through mmap, resulting in page faults and high RSS.What does the jemalloc patch do:
The 256 pieces of 16K size dirty memory will coalesce into a memory block larger than 64*16K, preventing reuse when a new 16K request arrives. To maximize dirty ecache reuse, we can increase the maximum ratio of the existing extent size to the requested extent size from 64 to 256.
Ref:
jemalloc/jemalloc#2842
Result:
<style> </style>We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base):
The geometric mean of all 43 queries shows more than a 10% performance improvement.
Refs: jemalloc/jemalloc#2842
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve memory reuse efficiency and reduce page faults when using the twolevel hashtables.
Documentation entry for user-facing changes