Remove valkey specific changes in jemalloc source code by zvi-code · Pull Request #1266 · valkey-io/valkey

zvi-code · 2024-11-05T15:34:54Z

*note: this PR replaced prior PR

Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the computeDefragCycles and not for every single ptr, this way we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will further improve the defrag decision and process.

Added files:

allocator_defrag.c / allocator_defrag.h - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.

Moved functions:

zmalloc_no_tcache , zfree_no_tcache - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].
frag_smallbins_bytes - the logic and implementation moved to the new file

Existing API:

[once a second + when completed full cycle] computeDefragCycles
- zmalloc_get_allocator_info : gets from jemalloc allocated, active, resident, retained, muzzy, frag_smallbins_bytes
- frag_smallbins_bytes : for each bin; gets from jemalloc bin_info, curr_regs, cur_slabs
[during defrag, for each pointer]
- je_defrag_hint is getting a memory pointer and returns {0,1} . Internally it uses this information points:
  - #nonfull_slabs
  - #total_slabs
  - #free regs in the ptr slab

Jemalloc API (via ctl interface)

[BATCH]experimental_utilization_batch_query_ctl : gets an array of pointers, returns for each pointer 3 values,

number of free regions in the extent
number of regions in the extent
size of the extent in terms of bytes

[EXTENDED]experimental_utilization_query_ctl :

memory address of the extent a potential reallocation would go into
number of free regions in the extent
number of regions in the extent
size of the extent in terms of bytes
[stats-enabled]total number of free regions in the bin the extent belongs to
[stats-enabled]total number of regions in the bin the extent belongs to

`experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`?

[good]

We can query pointers in a batch, reduce the overall overhead
The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily

[bad]

In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #nonfull_slabs and total allocated regs.

New functions:

defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:

The mallctlnametomib() function provides a way to avoid repeated name lookups for
applications that repeatedly query the same portion of the namespace,by translating
a name to a “Management Information Base” (MIB) that can be passed repeatedly to
mallctlbymib().
jemalloc_sz2binind_lgq* : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from lg_quantum
defrag_jemalloc_get_frag_smallbins : This function replaces frag_smallbins_bytes the logic moved to the new file allocator_defrag
defrag_jemalloc_should_defrag_multi → handle_results - unpacks the results
should_defrag : implements the same logic as the existing implementation inside je_defrag_hint
defrag_jemalloc_should_defrag_multi : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.

Logical differences:

In order to get the information about #nonfull_slabs and #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use jemalloc_sz2binind_lgq* .

Testing

This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.

Test:

disable activedefrag
run valkey benchmark on overlapping address ranges with different block sizes
wait untill used_memory reaches 10GB
set maxmemory to 5GB and maxmemory-policy to allkeys-lru
stop load
wait for mem_fragmentation_ratio to reach 2
enable activedefrag - start test timer
wait until reach mem_fragmentation_ratio = 1.1

Results*:

(With this PR)Test results: 56 sec
(Without this PR)Test results: 67 sec

*both runs perform same "work" number of buffers moved to reach fragmentation target

Next benchmarking is to compare to:

DONE // existing je_get_defrag_hint
compare with naive defrag all: int defrag_hint() {return 1;}

zvi-code · 2024-11-05T15:50:35Z

This PR is replacing https://github.com/valkey-io/valkey/pull/692.

It consists of 2 commits:

A cherry-pick of the original commit from PR 692 (on top of unstable)
A CR fixes commit to address the comments made on PR 692

codecov · 2024-11-06T12:14:41Z

Codecov Report

Attention: Patch coverage is 98.34711% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (2df56d8) to head (2a6c2df).
Report is 22 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/server.c	50.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1266      +/-   ##
============================================
+ Coverage     70.69%   70.73%   +0.04%     
============================================
  Files           114      116       +2     
  Lines         63161    63239      +78     
============================================
+ Hits          44650    44732      +82     
+ Misses        18511    18507       -4

Files with missing lines	Coverage Δ
src/allocator_defrag.c	`100.00% <100.00%> (ø)`
src/defrag.c	`84.68% <100.00%> (-1.58%)`	⬇️
src/server.h	`100.00% <ø> (ø)`
src/zmalloc.c	`82.60% <100.00%> (-2.07%)`	⬇️
src/server.c	`87.64% <50.00%> (-0.06%)`	⬇️

... and 20 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

madolson

Mostly minor comments, and all the tests are still passing which is good. I might take another pass after this to just cleanup some comments for clarity.

zvi-code · 2024-11-09T21:43:50Z

@madolson, I fixed the comments

madolson

It LGTM, only sticky point left is just some follows ups on the info fields. Would appreciate if one of @JimB123 or @zuiderkwast have time to take a look as well to review the jemalloc logic.

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

zvi-code · 2024-11-14T06:34:30Z

@valkey-io/core-team this PR replaces previous PR due to git history issues i could not fix. Can we put the 8.1 target (and if it's important than also mark with major decision approved)

zuiderkwast · 2024-11-14T09:55:25Z

this PR replaces #692 due to git history issues i could not fix

@zvi-code OK. It's possible to completely replace the content of a PR by force-pushing completely new commits to the same branch name though. Next time. :)

… required) Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zuiderkwast

Yes! This looks very close merging now. Just some nits and some questions.

JimB123

I like that we are eliminating a custom version of jemalloc. If we can upgrade jemalloc as a drop-in replacement, that's a win.

JimB123 · 2024-11-15T18:51:44Z

+
+    return makeDefragDecision(&arena_bin_conf.bin_info[binind],
+                              &curr_usage[binind],
+                              arena_bin_conf.bin_info[binind].nregs - nfree);


If what I said above is correct, this doesn't make sense to me. Are we computing the total number of regions in the BIN less the number of free regions in the SLAB? I don't understand what that would represent, or why. Is this a bug? or do we need more comments here.

No, please see documentation:

/* Represents detailed information about a jemalloc bin. * * This struct provides metadata about a jemalloc bin, including the size of * its regions, total number of regions, and related MIB keys for efficient * queries. */ typedef struct jeBinInfo { unsigned long reg_size; /* Size of each region in the bin. */ unsigned long nregs; /* Total number of regions in the bin. */ jeBinInfoKeys info_keys; /* Precomputed MIB keys for querying bin statistics. */ } jeBinInfo;

arena_bin_conf.bin_info holds metadata for each bin id (binind), as sucharena_bin_conf.bin_info[binind].nregs is the number of regions in each slab in this bin, it does not change after initial information is retrieved during init.

The code above says that nregs is the "total number of regions in the bin". You just said that it's the number of regions per slab. Which is it?

The hole struct holds static once initialized information about the bin. As such obviously it's not the number of regs in the bin at specific moment, but how many regs are there in a slab of this bin. I can change the comment to clarify if it's not clear

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zvi-code · 2024-11-16T22:11:04Z

@madolson , @JimB123 , @zuiderkwast , updated PR with fixes to the comments, please review

madolson

Official, stamp, will wait to see if others want to take another review pass before merging.

madolson · 2024-11-18T02:28:59Z

+void allocatorDefragFree(void *ptr, size_t size);
+__attribute__((malloc)) void *allocatorDefragAlloc(size_t size);


I was just going to add that we have a few of the zmalloc functions can now optionally take the size to improve the performance. See

valkey/src/zmalloc.c

Line 439 in aa2dd3e

void zfree_with_size(void *ptr, size_t size) {

for an example.

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zuiderkwast

Looks good in general. I haven't reviewed all the details. I just have a few nits and some questions.

What exactly is the relationship between the units zmalloc and allocator_defrag? Both of them are allocator abstractions, but one doesn't depend on the other. They're on the same level, but they still have some dependencies to each other: Both have to be using the same allocator.

This structure seems fine to me. Perhaps we could add some comment in each of them to refer to the other file or something.

        Application code
             /    \
 allocation /      \ defrag
           /        \
      zmalloc       allocator_defrag
      /  |   \       /     \
     /   |    \     /       \
    /    |     \   /         \
libc  tcmalloc  jemalloc     other

zvi-code · 2024-11-18T10:08:16Z

Looks good in general. I haven't reviewed all the details. I just have a few nits and some questions.

What exactly is the relationship between the units zmalloc and allocator_defrag? Both of them are allocator abstractions, but one doesn't depend on the other. They're on the same level, but they still have some dependencies to each other: Both have to be using the same allocator.

This structure seems fine to me. Perhaps we could add some comment in each of them to refer to the other file or something.
        Application code
             /    \
 allocation /      \ defrag
           /        \
      zmalloc       allocator_defrag
      /  |   \       /     \
     /   |    \     /       \
    /    |     \   /         \
libc  tcmalloc  jemalloc     other

@zuiderkwast , I like your diagram, it is correct. We could have a non allocator defragmentation logic, for example think about lazy listpack\rax compaction, but i removed this abstraction to reduce the scope of the PR. How about this documentation along with the diagram? [I got some help]

/*
 * This file implements allocator-specific defragmentation logic used
 * within the Valkey engine. Below is the relationship between various
 * components involved in allocation and defragmentation:
 *
 *                  Application code
 *                     /       \
 *         allocation /         \ defrag
 *                   /           \
 *              zmalloc    allocator_defrag
 *               /  |   \       /     \
 *              /   |    \     /       \
 *             /    |     \   /         \
 *        libc  tcmalloc  jemalloc     other
 *
 * Explanation:
 * - **Application code**: High-level application logic that uses memory
 *   allocation and may trigger defragmentation.
 * - **zmalloc**: An abstraction layer over the memory allocator, providing
 *   a uniform allocation interface to the application code. It can delegate
 *   to various underlying allocators (e.g., libc, tcmalloc, jemalloc, or others).
 *   It is not dependant on defrag implementation logic and it's possible to use jemalloc
 *   version that does not support defrag.
 * - **allocator_defrag**: This file contains allocator-specific logic for
 *   defragmentation, invoked from `defrag.c` when memory defragmentation is needed.
 *   currently jemalloc is the only allocator with implemented defrag logic. It is possible that 
 *   future implementation will include non-allocator defragmentation (think of data-structure
 *   compaction for example). 
 * - **Underlying allocators**: These are the actual memory allocators, such as
 *   libc, tcmalloc, jemalloc, or other custom allocators. The defragmentation
 *   logic in `allocator_defrag` interacts with these allocators to reorganize
 *   memory and reduce fragmentation.
 *
 * The `defrag.c` file acts as the central entry point for defragmentation,
 * invoking allocator-specific implementations provided here in `allocator_defrag.c`.
 *
 * Note: Developers working on `zmalloc` or `allocator_defrag` should refer to
 * the other component to ensure both are using the same allocator configuration.
 */

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zuiderkwast

LGTM

freswa · 2025-04-08T11:14:58Z

+    sz = sizeof(jemalloc_quantum);
+    je_mallctl("arenas.quantum", &jemalloc_quantum, &sz, NULL, 0);
+    /* lg-quantum should be 3 so jemalloc_quantum should be 1<<3 */
+    assert(jemalloc_quantum == 8);


For AMD64 and x86_64 it seems to be defined as

# if (defined(__amd64__) || defined(__x86_64__) || defined(_M_X64)) # define LG_QUANTUM 4 # endif

For Arch Linux with system jemalloc in use this assert fails reportedly (#1585 (comment)).

We still compile our included copy of jemalloc with some non-default options:

--with-lg-quantum=3 --disable-cache-oblivious --with-jemalloc-prefix=je_

This assert is unfortunate. We want it to work with any allocator. See also this issue: #1882

@zuiderkwast , initial PR had the support, but based on the feedback, we decided to remove it - #1266 (comment). We can re-add it if we think this is the right thing to do

For AMD64 and x86_64 it seems to be defined as

I guess this quote is the default for jemalloc, I believe this was exactly the reason we want to have the custom build, from source code with lg-quantum=3, to make sure the correct configurations of jemalloc for the valkey use case are used. Specifically in this case, lg-quantum=4 could greatly impact memory efficiency. Keep in mind, the 8 byte allocation size is used "strongly" all over valkey, including in memory optimization of sds (TYPE_5) list pack encoding and so on. Additionally, lg-quanum=3 allows also 24 bytes buffers, that are very common use in valkey.

@zvi-code Ohhh! Yes, we should definitely re-add it, so we can support system jemalloc. Distros like Debian already patched Valkey to use system jemalloc, and I think it is very risky because they will hit this assert. We can fix it in this issue:

[NEW] Possibility to use system jemalloc #1882

Keep in mind, the 8 byte allocation size is used "strongly" all over valkey

8 byte allocation is available even with lg-quantum = 4 (quantum 16), so there is no problem for this allocation size. It is presented in this table:

http://jemalloc.net/jemalloc.3.html#size_classes

Additionally, lg-quanum=3 allows also 24 bytes buffers, that are very common use in valkey.

This was true mostly for dictEntry. That's why lg-quantum = 3 was added long ago in this commit: 6b836b6.

However, we no longer use dict in data like keys, hashes, sets, sorted sets. We have a new hashtable instead and it's not using small allocations. (The remaining dicts are small and/or short-lived and we should probably replace them with hashtables too.)

So without extensive use of dictEntry, we no longer have many 24 byte allocations, so I believe there is no reason to use lg-quantum = 3 anymore. Maybe we can change to lg-quantum 4 completely, but we should at least allow it for externally built jemalloc versions.

The commmit "CR fixes 2" (fb2ca71) in this PR contains some hints to what we need to change to support defrag with lg-quantum 4. Another easier (temporary) patch is probably to just disable defrag when we detect lg-quantum 4.

zvi-code mentioned this pull request Nov 5, 2024

defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #692

Closed

madolson self-requested a review November 7, 2024 16:33

madolson reviewed Nov 7, 2024

View reviewed changes

zvi-code force-pushed the align_defrag_vanila_fix_history branch from 3443501 to 639fa6b Compare November 9, 2024 21:26

madolson reviewed Nov 13, 2024

View reviewed changes

Comment thread src/allocator_defrag.c

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Zvi Schneider and others added 12 commits November 13, 2024 15:06

defrag: use jemalloc api to align with vanila jemalloc

53f4f49

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>

CR fixes

3e6e284

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>

clang format fixes

1a4a432

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

Update src/allocator_defrag.c

dee5b96

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

Update src/allocator_defrag.c

644fd20

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

Update src/allocator_defrag.c

87987ce

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

Update src/allocator_defrag.c

3bb83be

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

CR fixes 2

fb2ca71

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

clang-format

d9fe41e

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

remove todo

6159838

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

cleanup + removal of unused code

fed1afb

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

fix cmake build

39a1f6d

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zvi-code force-pushed the align_defrag_vanila_fix_history branch from 11d69e6 to 39a1f6d Compare November 13, 2024 13:39

zvi-code and others added 2 commits November 13, 2024 16:01

Update src/allocator_defrag.c

21896ce

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

Update src/allocator_defrag.c

563fc86

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>

remove info and remove any unused code (will add if\when its actually…

41a9175

… required) Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zuiderkwast reviewed Nov 14, 2024

View reviewed changes

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

Comment thread src/allocator_defrag.c Outdated

JimB123 reviewed Nov 15, 2024

View reviewed changes

code review fixes

58f87fe

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

madolson added the release-notes This issue should get a line item in the release notes label Nov 18, 2024

madolson changed the title ~~defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code~~ Remove valkey specific changes in jemalloc source code Nov 18, 2024

madolson approved these changes Nov 18, 2024

View reviewed changes

fix build warning error

5f91d7e

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zuiderkwast reviewed Nov 18, 2024

View reviewed changes

Comment thread src/server.c Outdated

Comment thread src/server.c Outdated

Comment thread src/defrag.c

cr fixes: add\remove comments + fix include

2a6c2df

Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>

zuiderkwast approved these changes Nov 18, 2024

View reviewed changes

Comment thread src/server.h

JimB123 reviewed Nov 18, 2024

View reviewed changes

Comment thread src/allocator_defrag.c

JimB123 approved these changes Nov 19, 2024

View reviewed changes

madolson merged commit b56eed2 into valkey-io:unstable Nov 22, 2024

madolson mentioned this pull request Nov 24, 2024

[Test Failure] Test failure in 32bit defragmentation #1344

Closed

hpatro mentioned this pull request Mar 25, 2025

Jemalloc defrag situation #364

Open

freswa reviewed Apr 8, 2025

View reviewed changes

zuiderkwast mentioned this pull request Apr 8, 2025

[BUG] Memory corruption caught by hardened allocators #1585

Closed

zuiderkwast mentioned this pull request Jun 30, 2025

[NEW] Consider support for mimalloc allocator #346

Open

		void allocatorDefragFree(void *ptr, size_t size);
		__attribute__((malloc)) void *allocatorDefragAlloc(size_t size);

Uh oh!

Conversation

zvi-code commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of the change

Added files:

Moved functions:

Existing API:

Jemalloc API (via ctl interface)

experimental_utilization_batch_query_ctl vs valkey je_defrag_hint?

New functions:

Logical differences:

Testing

Test:

Results*:

Uh oh!

zvi-code commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zvi-code commented Nov 9, 2024

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zvi-code commented Nov 14, 2024

Uh oh!

zuiderkwast commented Nov 14, 2024

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JimB123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvi-code commented Nov 16, 2024

zvi-code commented Nov 5, 2024 •

edited

Loading

`experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`?

zvi-code commented Nov 5, 2024 •

edited

Loading

codecov Bot commented Nov 6, 2024 •

edited

Loading

zvi-code commented Nov 18, 2024 •

edited

Loading

zvi-code Apr 8, 2025 •

edited

Loading