Improve the reuse efficiency in dirty ecache and reduce page fault#2842
Improve the reuse efficiency in dirty ecache and reduce page fault#2842interwq merged 1 commit intojemalloc:devfrom
Conversation
interwq
left a comment
There was a problem hiding this comment.
Thanks for the patch and sharing the detailed investigation! It looks reasonable and can indeed allow more reuse after a lot of coalescing.
One question to confirm, does setting lg_extent_max_active_fit to 64 also show similar performance numbers? (not asking you to work around that way, but only for sanity checking that no other limiting factor there)
|
Hi @interwq , |
|
The PR looks good. Thanks @jiebinn ! One last thing, can you please squash the two commits into one, and then force push to the PR? Only need to keep the commit msg of first commit. |
…ents in the dirty ecache has been limited. This patch was tested with real workloads using ClickHouse (Clickbench Q35) on a system with 2x240 vCPUs. The results showed a 2X in query per second (QPS) performance and a reduction in page faults to 29% of the previous rate. Additionally, microbenchmark testing involved 256 memory reallocations resizing from 4KB to 16KB in one arena, which demonstrated a 5X performance improvement. Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
|
@jiebinn Merged. Again, really appreciate all the details and effort. That was great investigation plus the solution. |
The patch in jemalloc has been merged that can help to improve the hot dirty memory reuse when use twolevel hashtables. We have tested the patch with Clickbench Q35 on 2 x 240 vCPUs system. Here is the result (opt/base). QPS: 1.96x VmRSS: 54.6% Page fault: 29% cycles: 43% instructions: 85.7% IPC: 1.99x Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
This patch updates the jemalloc submodule to the latest version. The new version includes performance improvements that enhance hot dirty memory reuse when using two-level hashtables. We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
|
Hi @interwq , I was wondering if there are any plans to release a new stable version of Jemalloc. It's been three years since the last stable release, and we are eager to know if an update is on the horizon. |
This patch updates the jemalloc submodule to the latest version. The new version includes performance improvements that enhance hot dirty memory reuse when using two-level hashtables. We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing dirty extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing dirty extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: Zhiguo Zhou <zhiguo.zhou@intel.com>
performance issue analysis:
We identified the performance issue in many of the 43 ClickBench queries with ClickHouse on the 2x240 vCPUs system, particularly high page faults. For a deeper investigation, let's consider Q35. Q35 exhibited around 40% __handle_mm_fault hotspot in cycles on the GNR 2x240 vCPUs platform. We discovered that the high page faults stem from MADV_DONTNEED with the perf event.
In Query 35, there are 256 memory reallocations from 4KB to 16KB for each arena. Jemalloc recycles and coalesces many of these 256 16K memory blocks into a larger one when ClickHouse frees them. Subsequently, the large memory space in the dirty ecache cannot be reused for the next following 16K request, as the maximum allowed memory space before splitting is 16 * 64 KB. When a new memory request is made, Jemalloc locates an existing extent larger than the requested size in the dirty ecache and splits it to fit the actual requested size, maximizing memory reuse. There is a boundary for the requested extent size and the existing extent size, with a maximum ratio of 64 to minimize memory fragmentation. Consequently, Jemalloc would find possible memory space from the retained ecache (already MADV_DONTNEED and lacking physical pages) or through mmap, resulting in page faults and high RSS.
What does the patch do:
The idea is to reuse more dirty ecache extents in allocation/deallocation and also make sure that coalesce all the possible memory pieces in the purge/decay. There are two coalescing timings for large extents (>= 16KB) in Jemalloc: moving the extent during different states of ecaches (active -> dirty in our case) or decay/purging the existing extent in the dirty/muzzy ecache (dirty -> muzzy/retained) with background thread. Given the boundary "opt_lg_extent_max_active_fit" when requesting a new extent, we can apply the same boundary when coalescing existing extents if memory blocks return from ClickHouse to Jemalloc, enabling the reuse of most large extents in subsequent requests. This approach will not increase memory fragmentation, as there is no limit at the decay/purging stage. Jemalloc will merge all memory extents as much as possible and remove them from dirty to muzzy retained.
Result:
<style> </style>Test the patch with Q35 of Clickbench with ClickHouse on 2x240vCPUs system. The performance improvement after applying this patch can be seen in the table below.
We have also developed a microbench, which has 256 reallocations from 4KB to 16KB in each arena. Let's run 1000 iterations to get the average performance improvement. For 1 arena, the time cost is only 19.3% as before with the patch.
<style> </style>This pull request introduces a new
max_sizeparameter to the extent coalescing logic insrc/extent.c. The changes aim to improve memory management by limiting the maximum size of coalesced extents, particularly for large extents in the dirty ecache. The most important changes include adding themax_sizeparameter to relevant functions, updating the coalescing logic to respect this limit, and documenting the rationale behind the new behavior.Changes to extent coalescing logic:
Addition of
max_sizeparameter:extent_try_coalesce_impland related functions (extent_try_coalesceandextent_try_coalesce_large) were updated to include amax_sizeparameter, which restricts the maximum size of coalesced extents. [1] [2]Conditional checks for coalescing:
max_size. This ensures that overly large extents are not created, improving memory reuse efficiency. [1] [2]Documentation and rationale:
max_sizebehavior:max_sizeparameter for large extents in the dirty ecache. The documentation highlights how this change improves dirty ecache reuse efficiency while maintaining flexibility during decay/purge operations.