Skip to content

Improve the reuse efficiency in dirty ecache and reduce page fault#2842

Merged
interwq merged 1 commit intojemalloc:devfrom
jiebinn:reuse
May 12, 2025
Merged

Improve the reuse efficiency in dirty ecache and reduce page fault#2842
interwq merged 1 commit intojemalloc:devfrom
jiebinn:reuse

Conversation

@jiebinn
Copy link
Copy Markdown
Contributor

@jiebinn jiebinn commented Apr 30, 2025

performance issue analysis:
We identified the performance issue in many of the 43 ClickBench queries with ClickHouse on the 2x240 vCPUs system, particularly high page faults. For a deeper investigation, let's consider Q35. Q35 exhibited around 40% __handle_mm_fault hotspot in cycles on the GNR 2x240 vCPUs platform. We discovered that the high page faults stem from MADV_DONTNEED with the perf event.
In Query 35, there are 256 memory reallocations from 4KB to 16KB for each arena. Jemalloc recycles and coalesces many of these 256 16K memory blocks into a larger one when ClickHouse frees them. Subsequently, the large memory space in the dirty ecache cannot be reused for the next following 16K request, as the maximum allowed memory space before splitting is 16 * 64 KB. When a new memory request is made, Jemalloc locates an existing extent larger than the requested size in the dirty ecache and splits it to fit the actual requested size, maximizing memory reuse. There is a boundary for the requested extent size and the existing extent size, with a maximum ratio of 64 to minimize memory fragmentation. Consequently, Jemalloc would find possible memory space from the retained ecache (already MADV_DONTNEED and lacking physical pages) or through mmap, resulting in page faults and high RSS.

What does the patch do:
The idea is to reuse more dirty ecache extents in allocation/deallocation and also make sure that coalesce all the possible memory pieces in the purge/decay. There are two coalescing timings for large extents (>= 16KB) in Jemalloc: moving the extent during different states of ecaches (active -> dirty in our case) or decay/purging the existing extent in the dirty/muzzy ecache (dirty -> muzzy/retained) with background thread. Given the boundary "opt_lg_extent_max_active_fit" when requesting a new extent, we can apply the same boundary when coalescing existing extents if memory blocks return from ClickHouse to Jemalloc, enabling the reuse of most large extents in subsequent requests. This approach will not increase memory fragmentation, as there is no limit at the decay/purging stage. Jemalloc will merge all memory extents as much as possible and remove them from dirty to muzzy retained.

Result:
Test the patch with Q35 of Clickbench with ClickHouse on 2x240vCPUs system. The performance improvement after applying this patch can be seen in the table below.

<style> </style>
Q35 Query per second VmRSS Page fault cycles instructions IPC
Opt/Base 196.10% 54.60% 29.00% 43.00% 85.70% 199.30%

We have also developed a microbench, which has 256 reallocations from 4KB to 16KB in each arena. Let's run 1000 iterations to get the average performance improvement. For 1 arena, the time cost is only 19.3% as before with the patch.

<style> </style>
  total time cost to reallocate from 4KB to 16KB
Opt/Base 19.30%
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <jemalloc/jemalloc.h>

#define ARENA_COUNT 1
#define ALLOC_COUNT 256
#define INITIAL_SIZE (4 * 1024)
#define RESIZE_SIZE (16 * 1024)
#define ITERATIONS 1000

int main() {
    void **ptrs;
    unsigned *arena_ids;
    clock_t start, end;
    double total_time = 0.0;

    const char *version;
    size_t sz = sizeof(version);
    mallctl("version", &version, &sz, NULL, 0);
    printf("Using jemalloc version: %s\n", version);

    ptrs = malloc(ARENA_COUNT * ALLOC_COUNT * sizeof(void*));
    arena_ids = malloc(ARENA_COUNT * sizeof(unsigned));
    if (!ptrs || !arena_ids) {
        fprintf(stderr, "Failed to allocate pointer arrays\n");
        return 1;
    }

    for (unsigned i = 0; i < ARENA_COUNT; i++) {
        unsigned arena_id;
        size_t sz = sizeof(unsigned);
        if (mallctl("arenas.create", &arena_id, &sz, NULL, 0) != 0) {
            fprintf(stderr, "Failed to create arena %u\n", i);
            return 1;
        }
        arena_ids[i] = arena_id;
    }

    for (int iter = 0; iter < ITERATIONS; iter++) {
        //printf("Starting iteration %d...\n", iter+1);
        start = clock();

        for (unsigned i = 0; i < ARENA_COUNT; i++) {
            for (unsigned j = 0; j < ALLOC_COUNT; j++) {
                unsigned idx = i * ALLOC_COUNT + j;
                int flags = MALLOCX_ARENA(arena_ids[i]);
                ptrs[idx] = mallocx(INITIAL_SIZE, flags);
                if (!ptrs[idx]) {
                    fprintf(stderr, "Memory allocation failed\n");
                    return 1;
                }
                memset(ptrs[idx], 1, INITIAL_SIZE);
            }
        }

        for (unsigned i = 0; i < ARENA_COUNT; i++) {
            for (unsigned j = 0; j < ALLOC_COUNT; j++) {
                unsigned idx = i * ALLOC_COUNT + j;
                int flags = MALLOCX_ARENA(arena_ids[i]);
                void *new_ptr = rallocx(ptrs[idx], RESIZE_SIZE, flags);
                if (!new_ptr) {
                    fprintf(stderr, "Memory resize failed\n");
                    return 1;
                }
                ptrs[idx] = new_ptr;
                memset((char*)ptrs[idx] + INITIAL_SIZE, 2, RESIZE_SIZE - INITIAL_SIZE);
            }
        }

        for (unsigned i = 0; i < ARENA_COUNT; i++) {
            for (unsigned j = 0; j < ALLOC_COUNT; j++) {
                unsigned idx = i * ALLOC_COUNT + j;
                dallocx(ptrs[idx], 0);
            }
        }

        end = clock();
        double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
        total_time += time_taken;
	//printf("Iteration %d took %.6f seconds\n", iter+1, time_taken);
    }

    printf("Total time: %.6f seconds. Average time per loop: %.6f seconds\n", total_time, total_time / ITERATIONS);

    free(ptrs);
    free(arena_ids);

    return 0;
}

This pull request introduces a new max_size parameter to the extent coalescing logic in src/extent.c. The changes aim to improve memory management by limiting the maximum size of coalesced extents, particularly for large extents in the dirty ecache. The most important changes include adding the max_size parameter to relevant functions, updating the coalescing logic to respect this limit, and documenting the rationale behind the new behavior.

Changes to extent coalescing logic:

  • Addition of max_size parameter:

    • The extent_try_coalesce_impl and related functions (extent_try_coalesce and extent_try_coalesce_large) were updated to include a max_size parameter, which restricts the maximum size of coalesced extents. [1] [2]
  • Conditional checks for coalescing:

    • Added checks in both forward and backward coalescing logic to skip merging extents if their combined size exceeds max_size. This ensures that overly large extents are not created, improving memory reuse efficiency. [1] [2]

Documentation and rationale:

  • Detailed comments on max_size behavior:
    • Added extensive comments explaining the purpose of the max_size parameter for large extents in the dirty ecache. The documentation highlights how this change improves dirty ecache reuse efficiency while maintaining flexibility during decay/purge operations.

Copy link
Copy Markdown
Contributor

@interwq interwq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and sharing the detailed investigation! It looks reasonable and can indeed allow more reuse after a lot of coalescing.

One question to confirm, does setting lg_extent_max_active_fit to 64 also show similar performance numbers? (not asking you to work around that way, but only for sanity checking that no other limiting factor there)

@jiebinn
Copy link
Copy Markdown
Contributor Author

jiebinn commented May 1, 2025

Hi @interwq ,
Thank you for your kind and helpful suggestions. I have pushed another commit to address the overflow check and tidy up the code.
I also checked the performance data with lg_extent_max_active_fit set to 64 (almost no limit while choose extent before splitting), which is quite similar to the optimization patch.

@interwq
Copy link
Copy Markdown
Contributor

interwq commented May 8, 2025

The PR looks good. Thanks @jiebinn ! One last thing, can you please squash the two commits into one, and then force push to the PR? Only need to keep the commit msg of first commit.

…ents

in the dirty ecache has been limited. This patch was tested with real
workloads using ClickHouse (Clickbench Q35) on a system with 2x240 vCPUs.
The results showed a 2X in query per second (QPS) performance and
a reduction in page faults to 29% of the previous rate. Additionally,
microbenchmark testing involved 256 memory reallocations resizing
from 4KB to 16KB in one arena, which demonstrated a 5X performance
improvement.

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
@jiebinn
Copy link
Copy Markdown
Contributor Author

jiebinn commented May 9, 2025

The PR looks good. Thanks @jiebinn ! One last thing, can you please squash the two commits into one, and then force push to the PR? Only need to keep the commit msg of first commit.

The previous commits have been squashed into one and force pushed. Thanks @interwq!

@interwq interwq merged commit 3c14707 into jemalloc:dev May 12, 2025
19 of 20 checks passed
@interwq
Copy link
Copy Markdown
Contributor

interwq commented May 12, 2025

@jiebinn Merged. Again, really appreciate all the details and effort. That was great investigation plus the solution.

jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request May 15, 2025
The patch in jemalloc has been merged that can help to improve
the hot dirty memory reuse when use twolevel hashtables. We have
tested the patch with Clickbench Q35 on 2 x 240 vCPUs system.
Here is the result (opt/base).

QPS: 1.96x
VmRSS: 54.6%
Page fault: 29%
cycles: 43%
instructions: 85.7%
IPC: 1.99x

Refs: jemalloc/jemalloc#2842

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request May 15, 2025
This patch updates the jemalloc submodule to the latest version.
The new version includes performance improvements that enhance
hot dirty memory reuse when using two-level hashtables. We have
tested this patch with Clickbench Q35 on a system with 2 x 240
vCPUs. The results show significant performance gains (opt/base):

- QPS: 1.96x
- VmRSS: 54.6%
- Page faults: 29%
- Cycles: 43%
- Instructions: 85.7%
- IPC: 1.99x

The geometric mean of all 43 queries shows more than a 10%
performance improvement.

Refs: jemalloc/jemalloc#2842

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
@jiebinn
Copy link
Copy Markdown
Contributor Author

jiebinn commented May 16, 2025

Hi @interwq , I was wondering if there are any plans to release a new stable version of Jemalloc. It's been three years since the last stable release, and we are eager to know if an update is on the horizon.

jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request May 21, 2025
This patch updates the jemalloc submodule to the latest version.
The new version includes performance improvements that enhance
hot dirty memory reuse when using two-level hashtables. We have
tested this patch with Clickbench Q35 on a system with 2 x 240
vCPUs. The results show significant performance gains (opt/base):

- QPS: 1.96x
- VmRSS: 54.6%
- Page faults: 29%
- Cycles: 43%
- Instructions: 85.7%
- IPC: 1.99x

The geometric mean of all 43 queries shows more than a 10%
performance improvement.

Refs: jemalloc/jemalloc#2842

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request May 21, 2025
This patch helps to set lg_extent_max_active_fit to 8, which
will help jemalloc to reuse existing extents more efficiently
when using two-level hash tables (256 sub hashtables and
reallocations). We have tested this patch with Clickbench Q35
on a system with 2 x 240 vCPUs. The results show significant
performance gains (opt/base):

- QPS: 1.96x
- VmRSS: 54.6%
- Page faults: 29%
- Cycles: 43%
- Instructions: 85.7%
- IPC: 1.99x

The geometric mean of all 43 queries shows more than a 10%
performance improvement.

Refs: jemalloc/jemalloc#2842

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request May 21, 2025
This patch helps to set lg_extent_max_active_fit to 8, which
will help jemalloc to reuse existing dirty extents more
efficiently when using two-level hash tables (256 sub hashtables
and reallocations). We have tested this patch with Clickbench Q35
on a system with 2 x 240 vCPUs. The results show significant
performance gains (opt/base):

- QPS: 1.96x
- VmRSS: 54.6%
- Page faults: 29%
- Cycles: 43%
- Instructions: 85.7%
- IPC: 1.99x

The geometric mean of all 43 queries shows more than a 10%
performance improvement.

Refs: jemalloc/jemalloc#2842

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request May 26, 2025
This patch helps to set lg_extent_max_active_fit to 8, which
will help jemalloc to reuse existing dirty extents more
efficiently when using two-level hash tables (256 sub hashtables
and reallocations). We have tested this patch with Clickbench Q35
on a system with 2 x 240 vCPUs. The results show significant
performance gains (opt/base):

- QPS: 1.96x
- VmRSS: 54.6%
- Page faults: 29%
- Cycles: 43%
- Instructions: 85.7%
- IPC: 1.99x

The geometric mean of all 43 queries shows more than a 10%
performance improvement.

Refs: jemalloc/jemalloc#2842

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
Reviewed-by: Zhiguo Zhou <zhiguo.zhou@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants