Add memory utilization analytics to mallctl#1463
Conversation
| static int | ||
| experimental_frag_ctl(tsd_t *tsd, const size_t *mib, size_t miblen, | ||
| void *oldp, size_t *oldlenp, void *newp, size_t newlen) { | ||
| assert(oldp != NULL && newp != NULL); |
There was a problem hiding this comment.
We would need to handle the cases (return EINV).
| void *oldp, size_t *oldlenp, void *newp, size_t newlen) { | ||
| assert(oldp != NULL && newp != NULL); | ||
|
|
||
| /* trim *oldlenp and newlen so that: |
There was a problem hiding this comment.
Check the coding style here: https://github.com/jemalloc/jemalloc/wiki/Coding-Style
| * (b) both are trimmed to a multiple of the unit size (if not yet) | ||
| * rely on compiler to upgrade multiplication / division to shifts | ||
| */ | ||
| if (*oldlenp / (2 * sizeof(uint16_t)) < newlen / sizeof(void *)) { |
There was a problem hiding this comment.
As mentioned in person, let's use 32-bit.
| * resides in, where the first one is the number of free regions and | ||
| * the second one is the total number of regions | ||
| */ | ||
| void * const newend = newp + newlen; |
There was a problem hiding this comment.
It looks this is causing issues in Windows CI: https://ci.appveyor.com/project/jasone/jemalloc/builds/23165922/job/q69f6p9pi9ijulq1
| for (; newp < newend; newp += sizeof(void *)) { | ||
| const void *ptr = *(const void **)newp; | ||
| const extent_t *ext = iealloc(tsd_tsdn(tsd), ptr); | ||
| if (extent_slab_get(ext)) { |
| */ | ||
| void * const newend = newp + newlen; | ||
| for (; newp < newend; newp += sizeof(void *)) { | ||
| const void *ptr = *(const void **)newp; |
There was a problem hiding this comment.
The core logic (the lines in the loop) should live in the extent.c.
|
I don't think it's the most important thing in the world while this is experimental, but I also have doubts that this is the right API. I think better would be to return some token that maps to an extent, and then have ways to access extent information given a token (I suspect that in the future we won't just care about fragmentation information, but also size, sampled status, etc. etc.). We should chat more generally tomorrow. |
|
Revised the change based on our offline discussions.
|
interwq
left a comment
There was a problem hiding this comment.
We can start adding unit tests for the new API.
| const extent_t *ext = iealloc( | ||
| tsd_tsdn(tsd), *(const void **)newp); | ||
| if (unlikely(ext == NULL)) { | ||
| memset(oldp, 0, sizeof(size_t) * 3); |
There was a problem hiding this comment.
Not really need to zero it out. As long as we return an error code, the user is not supposed to look into the values anyway.
| */ | ||
| void * const newend = (const void **)newp + newlen; | ||
| for (; newp < newend; newp = (const void **)newp + 1) { | ||
| const extent_t *ext = iealloc( |
There was a problem hiding this comment.
Let’s move the lookup into the extent info get function as well, doesn’t look like we need the extent_t here other than as the input to that function.
| extent_frag_get(ext, (size_t *)oldp, | ||
| (size_t *)oldp + 1, (size_t *)oldp + 2); | ||
| } | ||
| oldp = (size_t *)oldp + 3; |
There was a problem hiding this comment.
Let’s make a size_t *ptr and adjust that. Better keep the oldp intact.
| } | ||
|
|
||
| static inline void | ||
| extent_frag_get(const extent_t *extent, |
There was a problem hiding this comment.
Nit: not sure about the function name
|
Made a couple of modifications according to feedback and offline discussions.
|
| bool extent_boot(void); | ||
|
|
||
| int | ||
| extent_frag_stats(tsdn_t *tsdn, const void *ptr, |
There was a problem hiding this comment.
nit: put "int" on same line, use up to 80 chars per line.
re: naming -- maybe, extent_util_stats_get? We already use the term utilization in the stats world.
| ret = EINVAL; | ||
| goto label_return; | ||
| } | ||
| newlen /= sizeof(const void *); |
There was a problem hiding this comment.
nit: Similarly, keep the original input args, feel free to define a new one with explicit naming.
| label_return: | ||
| return ret; | ||
| } | ||
|
|
There was a problem hiding this comment.
Suggestion: add some example of the return format / layout, something like.
/*
* input[0]: pointer_to_query | output[0]: extent_n_free_items
* | output[1]: ....
...
Also a simple description regarding what the 4 args do.
| const extent_t *extent = iealloc(tsdn, ptr); | ||
| if (unlikely(extent == NULL)) { | ||
| *nfree = *nregs = *size = 0U; | ||
| return EFAULT; |
There was a problem hiding this comment.
nit: this can simply return true; the EFAULT etc is for user-facing return codes. Internal functions don't really need to use them.
| *nregs = 1U; | ||
| } | ||
| *size = extent_size_get(extent); | ||
| return 0; |
There was a problem hiding this comment.
nit: add assert(*size >= nfree * nregs);
| static const ctl_named_node_t experimental_node[] = { | ||
| {NAME("hooks"), CHILD(named, hooks)} | ||
| {NAME("hooks"), CHILD(named, hooks)}, | ||
| {NAME("frag"), CTL(experimental_frag)}, |
There was a problem hiding this comment.
Naming is hard --frag feels a bit strange. Let's always use fragmentation?
|
Revised according to feedback.
|
|
Added two assertions:
As discussed offline, the two |
| * decisions are not known to the caller. Therefore, we permit pointers to | ||
| * memory usages that may not be returned by previous malloc calls, and we | ||
| * provide the caller a convenient way to identify such cases. | ||
| */ |
There was a problem hiding this comment.
nit: let's also add a simple example for the actual defrag work, e.g. disable tcache; free / alloc; re-enable tcache.
| * provide the caller a convenient way to identify such cases. | ||
| */ | ||
| static int | ||
| experimental_fragmentation_ctl(tsd_t *tsd, const size_t *mib, size_t miblen, |
There was a problem hiding this comment.
how about experimental_fragmentation_query, or maybe utilization instead of fragmentation?
|
|
||
| const void **ptrs = (const void **)newp; | ||
| const void ** const ptrs_end = ptrs + len; | ||
| size_t *frag_stats = (size_t *)oldp; |
There was a problem hiding this comment.
nit: let's name this util_stats, to match the extent_util call below.
| out_sz_ref = out_sz /= 2; | ||
| in_sz /= 2; | ||
| assert_d_eq(mallctl( | ||
| "experimental.fragmentation", out, &out_sz, in, in_sz), |
There was a problem hiding this comment.
nit: spacing is off. Let's always do 4 spaces for second line, regardless of the nested calls.
|
Revised according to feedback. |
|
Also changed macro naming in tests. |
| * | ||
| * A typical workflow would be composed of the following steps: | ||
| * | ||
| * (1) flush and disable tcache using mallctl("thread.tcache.enabled", ...) |
There was a problem hiding this comment.
nit: explain why tcache needs to be disabled. Also, do the experimental.utilization call first, and we should only disable tcache if defrag is considered as necessary.
David mentioned that using explicitly controlled manual tcache might work too; however I just realized we don't provide a way to replace auto tcache (embedded in TSD) with manual tcache. So disabling / re-enabling tcache is probably the easiest way.
| * A typical workflow would be composed of the following steps: | ||
| * | ||
| * (1) flush and disable tcache using mallctl("thread.tcache.enabled", ...) | ||
| * (2) initialize temporary arrays for input/output |
There was a problem hiding this comment.
nit: maybe a bit more detailed, e.g. fill the input array with pointers to query fragmentation.
| goto label_return; | ||
| } | ||
|
|
||
| const void **ptrs = (const void **)newp; |
There was a problem hiding this comment.
nit: I slightly prefer having an integer index for the loop here, i.e.
typedef struct {
size_t nfree;
...
} extent_util_stats_t;
extent_util_stats_t *output_stats = (extent_util_stats_t *)oldp;
void **input_ptrs = (void **)newp;
for (unsigned i = 0; i < len; i++) {
extent_util_stats_get(tsd, &input_ptrs[i], &output_stats[i].nfree, ...);
}
|
Revised the change -
|
|
|
||
| static const ctl_named_node_t utilization_node[] = { | ||
| {NAME("query"), CTL(experimental_utilization_query)}, | ||
| {NAME("batch_query"), CTL(experimental_utilization_batch_query)}, |
| static const ctl_named_node_t experimental_node[] = { | ||
| {NAME("hooks"), CHILD(named, hooks)} | ||
| {NAME("hooks"), CHILD(named, hooks)}, | ||
| {NAME("utilization"), CHILD(named, utilization)}, |
| * | ||
| * It's up to the application how to determine the significance of | ||
| * fragmentation relying on the statistics returned. Possible choices are: | ||
| * (a) if input pointer deviates a lot from potential reallocation address, |
There was a problem hiding this comment.
nit: I'd put (a) to the last, as I imagine it won't be a popular choice.
| * It can be beneficial to define the following macros to make it easier to read | ||
| * the returned statistics: | ||
| * | ||
| * #define SLABCUR *(void **)oldp |
There was a problem hiding this comment.
nit: for better readability, define the offsets based on sizeof(size_t) would be better, e.g. NFREE_OFFSET would be sizeof(size_t), with NFREE_READ(output) as *((size_t *)((size_t)output + NFREE_OFFSET)))
| * | ||
| * A typical workflow would be composed of the following steps: | ||
| * | ||
| * (1) flush tcache: mallctl("thread.tcache.flush", ...) |
There was a problem hiding this comment.
nit: let's not duplicate the workflow example. In the single query case, say refer to the other place is fine. We should try to clean up the comments for the single query, to focus on the difference from the batch case.
| bin_t *bin = &arena->bins[szind].bin_shards[binshard]; | ||
|
|
||
| malloc_mutex_lock(tsdn, &bin->lock); | ||
| const extent_t *cur = bin->slabcur; |
There was a problem hiding this comment.
nit: no need for the cur var.
| void | ||
| extent_util_stats_get(tsdn_t *tsdn, const void *ptr, size_t *nfree, | ||
| size_t *nregs, size_t *size, void **slabcur_addr) { | ||
| *nfree = *nregs = *size = 0; |
There was a problem hiding this comment.
nit: We try avoiding double initialization, mainly for the purpose of explicitly setting the values, instead of relying on the default value to cover multiple cases (more error prone). Let's remove this line and the branch below, and see the comments below for changes.
|
|
||
| const extent_t *extent = iealloc(tsdn, ptr); | ||
| if (unlikely(extent == NULL)) { | ||
| return; |
There was a problem hiding this comment.
... add *nfree = *nregs = *size = 0;, then goto label_return;
| *size = extent_size_get(extent); | ||
| if (!extent_slab_get(extent)) { | ||
| *nregs = 1; | ||
| return; |
There was a problem hiding this comment.
... add *nfree = 0; and then goto label_return.
|
Revised the following:
FYI in the latest redis usage (see https://github.com/antirez/redis/blob/167705519b5f21b2aa3954527be15dc653131221/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L219 for the implementation and https://github.com/antirez/redis/blob/22e9321c3ee238f498980cf18781e10caaa01589/src/defrag.c#L57 for the caller), the fragmentation is triggered at the following condition: which is equivalent to: assuming |
|
Tests fail when |
|
Fixed the failure. It also makes me realize that the redis implementation assumed stats are turned on, and might crash otherwise since there can be a "divide by zero". |
| * In case of large class allocations, "(a)" will be NULL, and "(e)" and "(f)" | ||
| * will be zero. The other three fields will be properly set though the values | ||
| * are trivial: "(b)" will be 0, "(c)" will be 1, and "(d)" will be the original | ||
| * allocation size rounded up to the nearest size class. |
There was a problem hiding this comment.
nit: the term here is usable size.
| static int | ||
| experimental_utilization_query_ctl(tsd_t *tsd, const size_t *mib, | ||
| size_t miblen, void *oldp, size_t *oldlenp, void *newp, size_t newlen) { | ||
| int ret = 0; |
There was a problem hiding this comment.
nit: similarly, try avoiding double initialization. Here we only need to define ret. And on the success path, do ret = 0 before label_return.
| experimental_utilization_query_ctl(tsd_t *tsd, const size_t *mib, | ||
| size_t miblen, void *oldp, size_t *oldlenp, void *newp, size_t newlen) { | ||
| int ret = 0; | ||
| void *ptr = NULL; |
There was a problem hiding this comment.
nit: no need to initialize.
| assert(sizeof(extent_util_stats_verbose_t) | ||
| == sizeof(void *) + sizeof(size_t) * 5); | ||
|
|
||
| WRITE(ptr, void *); |
There was a problem hiding this comment.
nit: let's move this to after the input check below; and also check for newp in the branch.
| * This API is mainly intended for small class allocations, where extents are | ||
| * used as slab. In case of large class allocations, the outputs are trivial: | ||
| * "(a)" will be 0, "(b)" will be 1, and "(c)" will be the original allocation | ||
| * size rounded up to the nearest size class. |
| static int | ||
| experimental_utilization_batch_query_ctl(tsd_t *tsd, const size_t *mib, | ||
| size_t miblen, void *oldp, size_t *oldlenp, void *newp, size_t newlen) { | ||
| int ret = 0; |
There was a problem hiding this comment.
similarly, not double initialize ret.
|
Revised according to feedback. |
The analytics tool is put under experimental.utilization namespace in mallctl. Input is one pointer or an array of pointers and the output is a list of memory utilization statistics.
| test_hooks, | ||
| test_hooks_exhaustion); | ||
| test_hooks_exhaustion, | ||
| test_utilization_query, |
There was a problem hiding this comment.
The util query tests deserve their own tests I feel. Let's move them to a util_query.c in a follow up diff.
|
I plan to issue PR [link TBD] to Valkey, that replaces the existing patch valkey with this API. One peace of information that current algorithm uses and is not included in the API is the number of fullslabs in the bin. The motivation for having the number of fullslabs is to improve the convergence of defrag(the fullslabs are shifting the average and may cause stagnation in defrag process). Probably this API will do the work:
|
|
It's interesting that there are no links between this PR and #566. Until today I was not aware of this PR. |
### Summary of the change This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc [native api](jemalloc/jemalloc#1463 (comment)) instead of relying on custom code changes made by valkey in the jemalloc ([je_defrag_hint](https://github.com/valkey-io/valkey/blob/9f8185f5c80bc98bdbc631b90ccf13929d6a0cbc/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L382)) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions. This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the `computeDefragCycles` and not for every single ptr, this way we are avoiding the additional api call. Followup work will utilize the new options that are now open and will further improve the defrag decision and process. ### Added files: `allocator_defrag.c` / `allocator_defrag.h` - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation. ### Moved functions: [`zmalloc_no_tcache` , `zfree_no_tcache` ](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/src/zmalloc.c#L215) - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc]. `frag_smallbins_bytes` - the logic and implementation moved to the new file ### Existing API: * [once a second + when completed full cycle] [`computeDefragCycles`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/src/defrag.c#L916) * `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active, resident, retained, muzzy_, `frag_smallbins_bytes` * [`frag_smallbins_bytes`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/src/zmalloc.c#L690) : for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs` * [during defrag, for each pointer] * `je_defrag_hint` is getting a memory pointer and returns {0,1} . [Internally it uses](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L368) this information points: * #`nonfull_slabs` * #`total_slabs` * #free regs in the ptr slab ## Jemalloc API (via ctl interface) [BATCH][`experimental_utilization_batch_query_ctl`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/jemalloc/src/ctl.c#L4114) : gets an array of pointers, returns for each pointer 3 values, * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes [EXTENDED][`experimental_utilization_query_ctl`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/jemalloc/src/ctl.c#L3989) : * memory address of the extent a potential reallocation would go into * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes * [stats-enabled]total number of free regions in the bin the extent belongs to * [stats-enabled]total number of regions in the bin the extent belongs to ### `experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`? [good] - We can query pointers in a batch, reduce the overall overhead - The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily [bad] - In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #`nonfull_slabs` and total allocated regs. ## New functions: 1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the [MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster calls. See this quote from the jemalloc documentation: The mallctlnametomib() function provides a way to avoid repeated name lookups for applications that repeatedly query the same portion of the namespace,by translating a name to a “Management Information Base” (MIB) that can be passed repeatedly to mallctlbymib(). 6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from [`lg_quantum`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/Makefile#L115) 7. `defrag_jemalloc_get_frag_smallbins` : This function replaces `frag_smallbins_bytes` the logic moved to the new file allocator_defrag `defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the results 8. `should_defrag` : implements the same logic as the existing implementation [inside](https://github.com/valkey-io/valkey/blob/9f8185f5c80bc98bdbc631b90ccf13929d6a0cbc/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L382) je_defrag_hint 9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed. ### Logical differences: In order to get the information about #`nonfull_slabs` and #`regs`, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use `jemalloc_sz2binind_lgq*` . ## Testing This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance. ### Test: 1. disable `activedefrag` 2. run valkey benchmark on overlapping address ranges with different block sizes 3. wait untill `used_memory` reaches 10GB 4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru` 5. stop load 6. wait for `mem_fragmentation_ratio` to reach 2 7. enable `activedefrag` - start test timer 8. wait until reach `mem_fragmentation_ratio` = 1.1 #### Results*: (With this PR)Test results: ` 56 sec` (Without this PR)Test results: `67 sec` *both runs perform same "work" number of buffers moved to reach fragmentation target Next benchmarking is to compare to: - DONE // existing `je_get_defrag_hint` - compare with naive defrag all: `int defrag_hint() {return 1;}` --------- Signed-off-by: Zvi Schneider <ezvisch@amazon.com> Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com> Co-authored-by: Zvi Schneider <ezvisch@amazon.com> Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The analytics tool is put under experimental.utilization namespace in
mallctl. Input is one pointer or an array of pointers and the output
is a list of memory utilization statistics.