Remove valkey specific changes in jemalloc source code#1266
Conversation
|
This PR is replacing https://github.com/valkey-io/valkey/pull/692. It consists of 2 commits:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1266 +/- ##
============================================
+ Coverage 70.69% 70.73% +0.04%
============================================
Files 114 116 +2
Lines 63161 63239 +78
============================================
+ Hits 44650 44732 +82
+ Misses 18511 18507 -4
|
madolson
left a comment
There was a problem hiding this comment.
Mostly minor comments, and all the tests are still passing which is good. I might take another pass after this to just cleanup some comments for clarity.
3443501 to
639fa6b
Compare
|
@madolson, I fixed the comments |
madolson
left a comment
There was a problem hiding this comment.
It LGTM, only sticky point left is just some follows ups on the info fields. Would appreciate if one of @JimB123 or @zuiderkwast have time to take a look as well to review the jemalloc logic.
Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
11d69e6 to
39a1f6d
Compare
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
|
@valkey-io/core-team this PR replaces previous PR due to git history issues i could not fix. Can we put the 8.1 target (and if it's important than also mark with |
… required) Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
Yes! This looks very close merging now. Just some nits and some questions.
JimB123
left a comment
There was a problem hiding this comment.
I like that we are eliminating a custom version of jemalloc. If we can upgrade jemalloc as a drop-in replacement, that's a win.
|
|
||
| return makeDefragDecision(&arena_bin_conf.bin_info[binind], | ||
| &curr_usage[binind], | ||
| arena_bin_conf.bin_info[binind].nregs - nfree); |
There was a problem hiding this comment.
If what I said above is correct, this doesn't make sense to me. Are we computing the total number of regions in the BIN less the number of free regions in the SLAB? I don't understand what that would represent, or why. Is this a bug? or do we need more comments here.
There was a problem hiding this comment.
No, please see documentation:
/* Represents detailed information about a jemalloc bin.
*
* This struct provides metadata about a jemalloc bin, including the size of
* its regions, total number of regions, and related MIB keys for efficient
* queries.
*/
typedef struct jeBinInfo {
unsigned long reg_size; /* Size of each region in the bin. */
unsigned long nregs; /* Total number of regions in the bin. */
jeBinInfoKeys info_keys; /* Precomputed MIB keys for querying bin statistics. */
} jeBinInfo;
arena_bin_conf.bin_info holds metadata for each bin id (binind), as sucharena_bin_conf.bin_info[binind].nregs is the number of regions in each slab in this bin, it does not change after initial information is retrieved during init.
There was a problem hiding this comment.
The code above says that nregs is the "total number of regions in the bin". You just said that it's the number of regions per slab. Which is it?
There was a problem hiding this comment.
The hole struct holds static once initialized information about the bin. As such obviously it's not the number of regs in the bin at specific moment, but how many regs are there in a slab of this bin. I can change the comment to clarify if it's not clear
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
|
@madolson , @JimB123 , @zuiderkwast , updated PR with fixes to the comments, please review |
madolson
left a comment
There was a problem hiding this comment.
Official, stamp, will wait to see if others want to take another review pass before merging.
| void allocatorDefragFree(void *ptr, size_t size); | ||
| __attribute__((malloc)) void *allocatorDefragAlloc(size_t size); |
There was a problem hiding this comment.
I was just going to add that we have a few of the zmalloc functions can now optionally take the size to improve the performance. See
Line 439 in aa2dd3e
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
Looks good in general. I haven't reviewed all the details. I just have a few nits and some questions.
What exactly is the relationship between the units zmalloc and allocator_defrag? Both of them are allocator abstractions, but one doesn't depend on the other. They're on the same level, but they still have some dependencies to each other: Both have to be using the same allocator.
This structure seems fine to me. Perhaps we could add some comment in each of them to refer to the other file or something.
Application code
/ \
allocation / \ defrag
/ \
zmalloc allocator_defrag
/ | \ / \
/ | \ / \
/ | \ / \
libc tcmalloc jemalloc other
@zuiderkwast , I like your diagram, it is correct. We could have a non allocator defragmentation logic, for example think about lazy listpack\rax compaction, but i removed this abstraction to reduce the scope of the PR. How about this documentation along with the diagram? [I got some help] |
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
| sz = sizeof(jemalloc_quantum); | ||
| je_mallctl("arenas.quantum", &jemalloc_quantum, &sz, NULL, 0); | ||
| /* lg-quantum should be 3 so jemalloc_quantum should be 1<<3 */ | ||
| assert(jemalloc_quantum == 8); |
There was a problem hiding this comment.
For AMD64 and x86_64 it seems to be defined as
# if (defined(__amd64__) || defined(__x86_64__) || defined(_M_X64))
# define LG_QUANTUM 4
# endif
For Arch Linux with system jemalloc in use this assert fails reportedly (#1585 (comment)).
There was a problem hiding this comment.
We still compile our included copy of jemalloc with some non-default options:
--with-lg-quantum=3 --disable-cache-oblivious --with-jemalloc-prefix=je_
This assert is unfortunate. We want it to work with any allocator. See also this issue: #1882
There was a problem hiding this comment.
@zuiderkwast , initial PR had the support, but based on the feedback, we decided to remove it - #1266 (comment). We can re-add it if we think this is the right thing to do
There was a problem hiding this comment.
For AMD64 and x86_64 it seems to be defined as
I guess this quote is the default for jemalloc, I believe this was exactly the reason we want to have the custom build, from source code with lg-quantum=3, to make sure the correct configurations of jemalloc for the valkey use case are used. Specifically in this case, lg-quantum=4 could greatly impact memory efficiency. Keep in mind, the 8 byte allocation size is used "strongly" all over valkey, including in memory optimization of sds (TYPE_5) list pack encoding and so on. Additionally, lg-quanum=3 allows also 24 bytes buffers, that are very common use in valkey.
There was a problem hiding this comment.
@zvi-code Ohhh! Yes, we should definitely re-add it, so we can support system jemalloc. Distros like Debian already patched Valkey to use system jemalloc, and I think it is very risky because they will hit this assert. We can fix it in this issue:
Keep in mind, the 8 byte allocation size is used "strongly" all over valkey
8 byte allocation is available even with lg-quantum = 4 (quantum 16), so there is no problem for this allocation size. It is presented in this table:
Additionally, lg-quanum=3 allows also 24 bytes buffers, that are very common use in valkey.
This was true mostly for dictEntry. That's why lg-quantum = 3 was added long ago in this commit: 6b836b6.
However, we no longer use dict in data like keys, hashes, sets, sorted sets. We have a new hashtable instead and it's not using small allocations. (The remaining dicts are small and/or short-lived and we should probably replace them with hashtables too.)
So without extensive use of dictEntry, we no longer have many 24 byte allocations, so I believe there is no reason to use lg-quantum = 3 anymore. Maybe we can change to lg-quantum 4 completely, but we should at least allow it for externally built jemalloc versions.
There was a problem hiding this comment.
The commmit "CR fixes 2" (fb2ca71) in this PR contains some hints to what we need to change to support defrag with lg-quantum 4. Another easier (temporary) patch is probably to just disable defrag when we detect lg-quantum 4.
*note: this PR replaced prior PR
Summary of the change
This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.
This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the
computeDefragCyclesand not for every single ptr, this way we are avoiding the additional api call.Followup work will utilize the new options that are now open and will further improve the defrag decision and process.
Added files:
allocator_defrag.c/allocator_defrag.h- This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.Moved functions:
zmalloc_no_tcache,zfree_no_tcache- these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].frag_smallbins_bytes- the logic and implementation moved to the new fileExisting API:
computeDefragCycleszmalloc_get_allocator_info: gets from jemalloc allocated, active, resident, retained, muzzy,frag_smallbins_bytesfrag_smallbins_bytes: for each bin; gets from jemalloc bin_info,curr_regs,cur_slabsje_defrag_hintis getting a memory pointer and returns {0,1} . Internally it uses this information points:nonfull_slabstotal_slabsJemalloc API (via ctl interface)
[BATCH]
experimental_utilization_batch_query_ctl: gets an array of pointers, returns for each pointer 3 values,[EXTENDED]
experimental_utilization_query_ctl:experimental_utilization_batch_query_ctlvs valkeyje_defrag_hint?[good]
[bad]
nonfull_slabsand total allocated regs.New functions:
defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:The mallctlnametomib() function provides a way to avoid repeated name lookups for
applications that repeatedly query the same portion of the namespace,by translating
a name to a “Management Information Base” (MIB) that can be passed repeatedly to
mallctlbymib().
jemalloc_sz2binind_lgq*: this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived fromlg_quantumdefrag_jemalloc_get_frag_smallbins: This function replacesfrag_smallbins_bytesthe logic moved to the new file allocator_defragdefrag_jemalloc_should_defrag_multi→handle_results- unpacks the resultsshould_defrag: implements the same logic as the existing implementation inside je_defrag_hintdefrag_jemalloc_should_defrag_multi: implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.Logical differences:
In order to get the information about #
nonfull_slabsand #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we usejemalloc_sz2binind_lgq*.Testing
This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.
Test:
activedefragused_memoryreaches 10GBmaxmemoryto 5GB andmaxmemory-policytoallkeys-lrumem_fragmentation_ratioto reach 2activedefrag- start test timermem_fragmentation_ratio= 1.1Results*:
(With this PR)Test results:
56 sec(Without this PR)Test results:
67 sec*both runs perform same "work" number of buffers moved to reach fragmentation target
Next benchmarking is to compare to:
je_get_defrag_hintint defrag_hint() {return 1;}