Improve the reuse efficiency in dirty ecache and reduce page fault by jiebinn · Pull Request #2842 · jemalloc/jemalloc

jiebinn · 2025-04-30T05:51:09Z

performance issue analysis:
We identified the performance issue in many of the 43 ClickBench queries with ClickHouse on the 2x240 vCPUs system, particularly high page faults. For a deeper investigation, let's consider Q35. Q35 exhibited around 40% __handle_mm_fault hotspot in cycles on the GNR 2x240 vCPUs platform. We discovered that the high page faults stem from MADV_DONTNEED with the perf event.
In Query 35, there are 256 memory reallocations from 4KB to 16KB for each arena. Jemalloc recycles and coalesces many of these 256 16K memory blocks into a larger one when ClickHouse frees them. Subsequently, the large memory space in the dirty ecache cannot be reused for the next following 16K request, as the maximum allowed memory space before splitting is 16 * 64 KB. When a new memory request is made, Jemalloc locates an existing extent larger than the requested size in the dirty ecache and splits it to fit the actual requested size, maximizing memory reuse. There is a boundary for the requested extent size and the existing extent size, with a maximum ratio of 64 to minimize memory fragmentation. Consequently, Jemalloc would find possible memory space from the retained ecache (already MADV_DONTNEED and lacking physical pages) or through mmap, resulting in page faults and high RSS.

What does the patch do:
The idea is to reuse more dirty ecache extents in allocation/deallocation and also make sure that coalesce all the possible memory pieces in the purge/decay. There are two coalescing timings for large extents (>= 16KB) in Jemalloc: moving the extent during different states of ecaches (active -> dirty in our case) or decay/purging the existing extent in the dirty/muzzy ecache (dirty -> muzzy/retained) with background thread. Given the boundary "opt_lg_extent_max_active_fit" when requesting a new extent, we can apply the same boundary when coalescing existing extents if memory blocks return from ClickHouse to Jemalloc, enabling the reuse of most large extents in subsequent requests. This approach will not increase memory fragmentation, as there is no limit at the decay/purging stage. Jemalloc will merge all memory extents as much as possible and remove them from dirty to muzzy retained.

Result:
Test the patch with Q35 of Clickbench with ClickHouse on 2x240vCPUs system. The performance improvement after applying this patch can be seen in the table below.

Q35	Query per second	VmRSS	Page fault	cycles	instructions	IPC
Opt/Base	196.10%	54.60%	29.00%	43.00%	85.70%	199.30%

We have also developed a microbench, which has 256 reallocations from 4KB to 16KB in each arena. Let's run 1000 iterations to get the average performance improvement. For 1 arena, the time cost is only 19.3% as before with the patch.

	total time cost to reallocate from 4KB to 16KB
Opt/Base	19.30%

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <jemalloc/jemalloc.h>

#define ARENA_COUNT 1
#define ALLOC_COUNT 256
#define INITIAL_SIZE (4 * 1024)
#define RESIZE_SIZE (16 * 1024)
#define ITERATIONS 1000

int main() {
    void **ptrs;
    unsigned *arena_ids;
    clock_t start, end;
    double total_time = 0.0;

    const char *version;
    size_t sz = sizeof(version);
    mallctl("version", &version, &sz, NULL, 0);
    printf("Using jemalloc version: %s\n", version);

    ptrs = malloc(ARENA_COUNT * ALLOC_COUNT * sizeof(void*));
    arena_ids = malloc(ARENA_COUNT * sizeof(unsigned));
    if (!ptrs || !arena_ids) {
        fprintf(stderr, "Failed to allocate pointer arrays\n");
        return 1;
    }

    for (unsigned i = 0; i < ARENA_COUNT; i++) {
        unsigned arena_id;
        size_t sz = sizeof(unsigned);
        if (mallctl("arenas.create", &arena_id, &sz, NULL, 0) != 0) {
            fprintf(stderr, "Failed to create arena %u\n", i);
            return 1;
        }
        arena_ids[i] = arena_id;
    }

    for (int iter = 0; iter < ITERATIONS; iter++) {
        //printf("Starting iteration %d...\n", iter+1);
        start = clock();

        for (unsigned i = 0; i < ARENA_COUNT; i++) {
            for (unsigned j = 0; j < ALLOC_COUNT; j++) {
                unsigned idx = i * ALLOC_COUNT + j;
                int flags = MALLOCX_ARENA(arena_ids[i]);
                ptrs[idx] = mallocx(INITIAL_SIZE, flags);
                if (!ptrs[idx]) {
                    fprintf(stderr, "Memory allocation failed\n");
                    return 1;
                }
                memset(ptrs[idx], 1, INITIAL_SIZE);
            }
        }

        for (unsigned i = 0; i < ARENA_COUNT; i++) {
            for (unsigned j = 0; j < ALLOC_COUNT; j++) {
                unsigned idx = i * ALLOC_COUNT + j;
                int flags = MALLOCX_ARENA(arena_ids[i]);
                void *new_ptr = rallocx(ptrs[idx], RESIZE_SIZE, flags);
                if (!new_ptr) {
                    fprintf(stderr, "Memory resize failed\n");
                    return 1;
                }
                ptrs[idx] = new_ptr;
                memset((char*)ptrs[idx] + INITIAL_SIZE, 2, RESIZE_SIZE - INITIAL_SIZE);
            }
        }

        for (unsigned i = 0; i < ARENA_COUNT; i++) {
            for (unsigned j = 0; j < ALLOC_COUNT; j++) {
                unsigned idx = i * ALLOC_COUNT + j;
                dallocx(ptrs[idx], 0);
            }
        }

        end = clock();
        double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
        total_time += time_taken;
	//printf("Iteration %d took %.6f seconds\n", iter+1, time_taken);
    }

    printf("Total time: %.6f seconds. Average time per loop: %.6f seconds\n", total_time, total_time / ITERATIONS);

    free(ptrs);
    free(arena_ids);

    return 0;
}

This pull request introduces a new max_size parameter to the extent coalescing logic in src/extent.c. The changes aim to improve memory management by limiting the maximum size of coalesced extents, particularly for large extents in the dirty ecache. The most important changes include adding the max_size parameter to relevant functions, updating the coalescing logic to respect this limit, and documenting the rationale behind the new behavior.

Changes to extent coalescing logic:

Addition of max_size parameter:
- The extent_try_coalesce_impl and related functions (extent_try_coalesce and extent_try_coalesce_large) were updated to include a max_size parameter, which restricts the maximum size of coalesced extents. [1] [2]
Conditional checks for coalescing:
- Added checks in both forward and backward coalescing logic to skip merging extents if their combined size exceeds max_size. This ensures that overly large extents are not created, improving memory reuse efficiency. [1] [2]

Documentation and rationale:

Detailed comments on max_size behavior:
- Added extensive comments explaining the purpose of the max_size parameter for large extents in the dirty ecache. The documentation highlights how this change improves dirty ecache reuse efficiency while maintaining flexibility during decay/purge operations.

interwq

Thanks for the patch and sharing the detailed investigation! It looks reasonable and can indeed allow more reuse after a lot of coalescing.

One question to confirm, does setting lg_extent_max_active_fit to 64 also show similar performance numbers? (not asking you to work around that way, but only for sanity checking that no other limiting factor there)

src/extent.c

jiebinn · 2025-05-01T02:48:56Z

Hi @interwq ,
Thank you for your kind and helpful suggestions. I have pushed another commit to address the overflow check and tidy up the code.
I also checked the performance data with lg_extent_max_active_fit set to 64 (almost no limit while choose extent before splitting), which is quite similar to the optimization patch.

interwq · 2025-05-08T17:59:32Z

The PR looks good. Thanks @jiebinn ! One last thing, can you please squash the two commits into one, and then force push to the PR? Only need to keep the commit msg of first commit.

…ents in the dirty ecache has been limited. This patch was tested with real workloads using ClickHouse (Clickbench Q35) on a system with 2x240 vCPUs. The results showed a 2X in query per second (QPS) performance and a reduction in page faults to 29% of the previous rate. Additionally, microbenchmark testing involved 256 memory reallocations resizing from 4KB to 16KB in one arena, which demonstrated a 5X performance improvement. Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

jiebinn · 2025-05-09T02:07:07Z

The PR looks good. Thanks @jiebinn ! One last thing, can you please squash the two commits into one, and then force push to the PR? Only need to keep the commit msg of first commit.

The previous commits have been squashed into one and force pushed. Thanks @interwq!

interwq · 2025-05-12T22:47:09Z

@jiebinn Merged. Again, really appreciate all the details and effort. That was great investigation plus the solution.

The patch in jemalloc has been merged that can help to improve the hot dirty memory reuse when use twolevel hashtables. We have tested the patch with Clickbench Q35 on 2 x 240 vCPUs system. Here is the result (opt/base). QPS: 1.96x VmRSS: 54.6% Page fault: 29% cycles: 43% instructions: 85.7% IPC: 1.99x Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>

This patch updates the jemalloc submodule to the latest version. The new version includes performance improvements that enhance hot dirty memory reuse when using two-level hashtables. We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>

jiebinn · 2025-05-16T09:55:43Z

Hi @interwq , I was wondering if there are any plans to release a new stable version of Jemalloc. It's been three years since the last stable release, and we are eager to know if an update is on the horizon.

This patch updates the jemalloc submodule to the latest version. The new version includes performance improvements that enhance hot dirty memory reuse when using two-level hashtables. We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>

This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>

This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing dirty extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>

This patch helps to set lg_extent_max_active_fit to 8, which will help jemalloc to reuse existing dirty extents more efficiently when using two-level hash tables (256 sub hashtables and reallocations). We have tested this patch with Clickbench Q35 on a system with 2 x 240 vCPUs. The results show significant performance gains (opt/base): - QPS: 1.96x - VmRSS: 54.6% - Page faults: 29% - Cycles: 43% - Instructions: 85.7% - IPC: 1.99x The geometric mean of all 43 queries shows more than a 10% performance improvement. Refs: jemalloc/jemalloc#2842 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: Zhiguo Zhou <zhiguo.zhou@intel.com>

interwq reviewed Apr 30, 2025

View reviewed changes

src/extent.c Outdated Show resolved Hide resolved

src/extent.c Outdated Show resolved Hide resolved

src/extent.c Outdated Show resolved Hide resolved

interwq approved these changes May 8, 2025

View reviewed changes

jiebinn force-pushed the reuse branch from 528d673 to f81b199 Compare May 9, 2025 02:03

interwq merged commit 3c14707 into jemalloc:dev May 12, 2025
19 of 20 checks passed

jiebinn mentioned this pull request May 15, 2025

Improve memory reuse efficiency and reduce page faults if using two-level hash tables ClickHouse/ClickHouse#80245

Merged

1 task

xhumanoid mentioned this pull request Feb 2, 2026

[Enhancement] introduce memory allocator StarRocks/starrocks#68788

Merged

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the reuse efficiency in dirty ecache and reduce page fault#2842

Improve the reuse efficiency in dirty ecache and reduce page fault#2842
interwq merged 1 commit intojemalloc:devfrom
jiebinn:reuse

jiebinn commented Apr 30, 2025 •

edited

Loading

Uh oh!

interwq left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiebinn commented May 1, 2025

Uh oh!

interwq commented May 8, 2025

Uh oh!

jiebinn commented May 9, 2025

Uh oh!

Uh oh!

interwq commented May 12, 2025

Uh oh!

jiebinn commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiebinn commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes to extent coalescing logic:

Documentation and rationale:

Uh oh!

interwq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiebinn commented May 1, 2025

Uh oh!

interwq commented May 8, 2025

Uh oh!

jiebinn commented May 9, 2025

Uh oh!

Uh oh!

interwq commented May 12, 2025

Uh oh!

jiebinn commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiebinn commented Apr 30, 2025 •

edited

Loading