Skip to content

Add memory utilization analytics to mallctl#1463

Merged
interwq merged 1 commit into
jemalloc:devfrom
yinan1048576:defrag
Apr 4, 2019
Merged

Add memory utilization analytics to mallctl#1463
interwq merged 1 commit into
jemalloc:devfrom
yinan1048576:defrag

Conversation

@yinan1048576

@yinan1048576 yinan1048576 commented Mar 18, 2019

Copy link
Copy Markdown
Contributor

The analytics tool is put under experimental.utilization namespace in
mallctl. Input is one pointer or an array of pointers and the output
is a list of memory utilization statistics.

@interwq interwq changed the title add fragmentation analytics to mallctl [test only] [not for review yet] add fragmentation analytics to mallctl [test only] Mar 18, 2019
Comment thread src/ctl.c Outdated
static int
experimental_frag_ctl(tsd_t *tsd, const size_t *mib, size_t miblen,
void *oldp, size_t *oldlenp, void *newp, size_t newlen) {
assert(oldp != NULL && newp != NULL);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to handle the cases (return EINV).

Comment thread src/ctl.c Outdated
void *oldp, size_t *oldlenp, void *newp, size_t newlen) {
assert(oldp != NULL && newp != NULL);

/* trim *oldlenp and newlen so that:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/ctl.c Outdated
* (b) both are trimmed to a multiple of the unit size (if not yet)
* rely on compiler to upgrade multiplication / division to shifts
*/
if (*oldlenp / (2 * sizeof(uint16_t)) < newlen / sizeof(void *)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in person, let's use 32-bit.

Comment thread src/ctl.c Outdated
* resides in, where the first one is the number of free regions and
* the second one is the total number of regions
*/
void * const newend = newp + newlen;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/ctl.c Outdated
for (; newp < newend; newp += sizeof(void *)) {
const void *ptr = *(const void **)newp;
const extent_t *ext = iealloc(tsd_tsdn(tsd), ptr);
if (extent_slab_get(ext)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert(ext != NULL);

Comment thread src/ctl.c Outdated
*/
void * const newend = newp + newlen;
for (; newp < newend; newp += sizeof(void *)) {
const void *ptr = *(const void **)newp;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core logic (the lines in the loop) should live in the extent.c.

@davidtgoldblatt

Copy link
Copy Markdown
Contributor

I don't think it's the most important thing in the world while this is experimental, but I also have doubts that this is the right API. I think better would be to return some token that maps to an extent, and then have ways to access extent information given a token (I suspect that in the future we won't just care about fragmentation information, but also size, sampled status, etc. etc.). We should chat more generally tomorrow.

@yinan1048576 yinan1048576 changed the title [not for review yet] add fragmentation analytics to mallctl [test only] Add fragmentation analytics to mallctl Mar 20, 2019
@yinan1048576

Copy link
Copy Markdown
Contributor Author

Revised the change based on our offline discussions.

  • Return three instead of two pieces of information for each input pointer: number of free regions, the total number of regions, and the total size of the corresponding extent. Application can have whatever criteria they want afterwards for determining need for defragmentation. Did not go with returning the handles of the corresponding extents because we don't want to always guarantee that the underlying extents will stay before the caller tries to query the other information using the extent handles.
  • The type of extent total size is size_t, so I bumped up the size of all three fields to size_t so that it's convenient for the application to fetch them.
  • I explicitly terminate and raise error whenever seeing improper input sizes, rather than silently manipulating them as I did previously.
  • I'm not exactly sure when the extent_t output from iealloc can be NULL. In theory the input pointers can be anything, and I assume it's possible that the iealloc output might output NULL. In such cases rather than doing assertion, I zero the output, continue the computation, but report an error to the caller in the end. The effect is that if the caller made a mistake when supplying the input pointers, which I assume was most likely due to carelessness, they know exactly if and where there are the mistake(s) afterwards, and all the other correct pointers got their output.
  • The issue reported from the Windows compiler was a legitimate one that gcc should have catched. I made the correction and hopefully the Windows compiler is happy this time...

@interwq interwq left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start adding unit tests for the new API.

Comment thread src/ctl.c Outdated
const extent_t *ext = iealloc(
tsd_tsdn(tsd), *(const void **)newp);
if (unlikely(ext == NULL)) {
memset(oldp, 0, sizeof(size_t) * 3);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really need to zero it out. As long as we return an error code, the user is not supposed to look into the values anyway.

Comment thread src/ctl.c Outdated
*/
void * const newend = (const void **)newp + newlen;
for (; newp < newend; newp = (const void **)newp + 1) {
const extent_t *ext = iealloc(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s move the lookup into the extent info get function as well, doesn’t look like we need the extent_t here other than as the input to that function.

Comment thread src/ctl.c Outdated
extent_frag_get(ext, (size_t *)oldp,
(size_t *)oldp + 1, (size_t *)oldp + 2);
}
oldp = (size_t *)oldp + 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s make a size_t *ptr and adjust that. Better keep the oldp intact.

}

static inline void
extent_frag_get(const extent_t *extent,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not sure about the function name

@yinan1048576

yinan1048576 commented Mar 21, 2019

Copy link
Copy Markdown
Contributor Author

Made a couple of modifications according to feedback and offline discussions.

  • Still decide to zero out the output fields for invalid input pointers. I've put the reasoning as comment. One more comment (for the comment): valid pointers (i.e. those returned from previous malloc calls) will surely give non-NULL extent, but invalid pointers may or may not always give NULL extent depending on the underlying implementation, and some "maliciously" invalid pointers may also directly lead to segfault. In my reasoning, for simplicity, I'm assuming: (a) the caller has good intention and provide us pointers they truly want to get fragmentation info about, and (b) NULL extent is a sufficient and necessary indication of invalid pointers.
  • Encapsulate the extent looking-up and extent reading into a single function and put it in extent.c, with declaration put in extent_extern.h. Again, not sure whether the function name is ideal - I'm calling it extent_frag_stats, which sounds slightly better than extent_frag to me.
  • Define additional local variables to make the code more intuitive and readable.
  • Added unit tests. I didn't add unit tests for testing NULL extent though, because it's still a mystery to me what the sufficient and necessary conditions for triggering it would be. In my local micro benchmark, I query for pointers to both the metadata and the content of std::string (where the content may or may not be an independent dynamic memory allocation), and I sometimes bump into EFAULT, but still no clue on the triggering condition.

bool extent_boot(void);

int
extent_frag_stats(tsdn_t *tsdn, const void *ptr,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put "int" on same line, use up to 80 chars per line.

re: naming -- maybe, extent_util_stats_get? We already use the term utilization in the stats world.

Comment thread src/ctl.c Outdated
ret = EINVAL;
goto label_return;
}
newlen /= sizeof(const void *);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Similarly, keep the original input args, feel free to define a new one with explicit naming.

Comment thread src/ctl.c
label_return:
return ret;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add some example of the return format / layout, something like.

/*
 * input[0]:  pointer_to_query  |  output[0]: extent_n_free_items
 *                              |  output[1]: ....
  ...

Also a simple description regarding what the 4 args do.

Comment thread src/extent.c Outdated
const extent_t *extent = iealloc(tsdn, ptr);
if (unlikely(extent == NULL)) {
*nfree = *nregs = *size = 0U;
return EFAULT;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can simply return true; the EFAULT etc is for user-facing return codes. Internal functions don't really need to use them.

Comment thread src/extent.c Outdated
*nregs = 1U;
}
*size = extent_size_get(extent);
return 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add assert(*size >= nfree * nregs);

Comment thread src/ctl.c Outdated
static const ctl_named_node_t experimental_node[] = {
{NAME("hooks"), CHILD(named, hooks)}
{NAME("hooks"), CHILD(named, hooks)},
{NAME("frag"), CTL(experimental_frag)},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard --frag feels a bit strange. Let's always use fragmentation?

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Revised according to feedback.

  • Improve formatting / naming / readability.
  • Do not return EFAULT when extent is NULL. It doesn't affect the caller's ability to detect it at all because the output fields are zeroed out.
  • Improve comments.

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Added two assertions:

  • assert(*nfree <= *nregs);
  • assert(*nfree * extent_usize_get(extent) <= *size);

As discussed offline, the two <= should always be <, but we allow == since it might not be a big issue even if it somehow appears.

Comment thread src/ctl.c
* decisions are not known to the caller. Therefore, we permit pointers to
* memory usages that may not be returned by previous malloc calls, and we
* provide the caller a convenient way to identify such cases.
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's also add a simple example for the actual defrag work, e.g. disable tcache; free / alloc; re-enable tcache.

Comment thread src/ctl.c Outdated
* provide the caller a convenient way to identify such cases.
*/
static int
experimental_fragmentation_ctl(tsd_t *tsd, const size_t *mib, size_t miblen,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about experimental_fragmentation_query, or maybe utilization instead of fragmentation?

Comment thread src/ctl.c Outdated

const void **ptrs = (const void **)newp;
const void ** const ptrs_end = ptrs + len;
size_t *frag_stats = (size_t *)oldp;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's name this util_stats, to match the extent_util call below.

Comment thread test/unit/mallctl.c Outdated
out_sz_ref = out_sz /= 2;
in_sz /= 2;
assert_d_eq(mallctl(
"experimental.fragmentation", out, &out_sz, in, in_sz),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing is off. Let's always do 4 spaces for second line, regardless of the nested calls.

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Revised according to feedback.

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Also changed macro naming in tests.

Comment thread src/ctl.c Outdated
*
* A typical workflow would be composed of the following steps:
*
* (1) flush and disable tcache using mallctl("thread.tcache.enabled", ...)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explain why tcache needs to be disabled. Also, do the experimental.utilization call first, and we should only disable tcache if defrag is considered as necessary.

David mentioned that using explicitly controlled manual tcache might work too; however I just realized we don't provide a way to replace auto tcache (embedded in TSD) with manual tcache. So disabling / re-enabling tcache is probably the easiest way.

Comment thread src/ctl.c Outdated
* A typical workflow would be composed of the following steps:
*
* (1) flush and disable tcache using mallctl("thread.tcache.enabled", ...)
* (2) initialize temporary arrays for input/output

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a bit more detailed, e.g. fill the input array with pointers to query fragmentation.

Comment thread src/ctl.c
Comment thread src/ctl.c Outdated
goto label_return;
}

const void **ptrs = (const void **)newp;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I slightly prefer having an integer index for the loop here, i.e.

typedef struct {
    size_t nfree;
    ...
} extent_util_stats_t;
extent_util_stats_t *output_stats = (extent_util_stats_t *)oldp;
void **input_ptrs = (void **)newp;
for (unsigned i = 0; i < len; i++) {
    extent_util_stats_get(tsd, &input_ptrs[i], &output_stats[i].nfree, ...);
}

@yinan1048576

yinan1048576 commented Apr 2, 2019

Copy link
Copy Markdown
Contributor Author

Revised the change -

  • Split the mallctl API into two: experimental.utilization.query for single pointer query, and experimental.utilization.batch_query for batch querying.
  • For single pointer query, return an additional field at the beginning: the address of slabcur, which is the place of a potential reallocation. It's up to the application how to use it to determine fragmentation. In the implementation from Redis, fragmentation is set to be true whenever this address of potential reallocation is not equal to the extent the input pointer resides in, which might be an overly strict criteria. If really desired, the application can achieve the same effect by examining whether the deviation of the input pointer from our returned slabcur_addr is below a very low threshold (i.e. the size of a single extent), but it's also possible to define less strict criteria by loosening the deviation threshold.
  • Do not return slabcur_addr for batch querying, because it'd be quickly changing throughout the defragmentation process and thus returning it for all pointers at query time may not make that much sense for determining defragmentation. The nfree field can also change throughout the defragmentation process, but it is relatively less volatile since it's always local to the extent the pointer resides in, and it has a nicer property: in the middle of the defragmentation process, at the time a particular pointer is defragmented, (a) if nfree decreased from what it was at querying time, then it means the defragmentation process up till this moment favored reallocating to this extent, implying that it hasn't been a problem earlier; (b) if nfree increased, then that means the defragmentation process up till this moment favored reallocating away from this extent, implying that it has already been a problem earlier. In both cases the defragmentation determination at the querying time wouldn't be affected even if we had relied on the nfree at this later time.
  • Explicitly define structs holding the output fields so as to improve readability. The structs are not exposed out and are only for internal use in src/ctl.c. I didn't make use of the structs in the unit tests because I'd like to mimic what application would do.
  • Improve the comment and code readability.

Comment thread src/ctl.c Outdated

static const ctl_named_node_t utilization_node[] = {
{NAME("query"), CTL(experimental_utilization_query)},
{NAME("batch_query"), CTL(experimental_utilization_batch_query)},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no comma at the end.

Comment thread src/ctl.c Outdated
static const ctl_named_node_t experimental_node[] = {
{NAME("hooks"), CHILD(named, hooks)}
{NAME("hooks"), CHILD(named, hooks)},
{NAME("utilization"), CHILD(named, utilization)},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comma too

Comment thread src/ctl.c Outdated
*
* It's up to the application how to determine the significance of
* fragmentation relying on the statistics returned. Possible choices are:
* (a) if input pointer deviates a lot from potential reallocation address,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd put (a) to the last, as I imagine it won't be a popular choice.

Comment thread src/ctl.c Outdated
* It can be beneficial to define the following macros to make it easier to read
* the returned statistics:
*
* #define SLABCUR *(void **)oldp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for better readability, define the offsets based on sizeof(size_t) would be better, e.g. NFREE_OFFSET would be sizeof(size_t), with NFREE_READ(output) as *((size_t *)((size_t)output + NFREE_OFFSET)))

Comment thread src/ctl.c
*
* A typical workflow would be composed of the following steps:
*
* (1) flush tcache: mallctl("thread.tcache.flush", ...)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's not duplicate the workflow example. In the single query case, say refer to the other place is fine. We should try to clean up the comments for the single query, to focus on the difference from the batch case.

Comment thread src/extent.c Outdated
bin_t *bin = &arena->bins[szind].bin_shards[binshard];

malloc_mutex_lock(tsdn, &bin->lock);
const extent_t *cur = bin->slabcur;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for the cur var.

Comment thread src/extent.c Outdated
void
extent_util_stats_get(tsdn_t *tsdn, const void *ptr, size_t *nfree,
size_t *nregs, size_t *size, void **slabcur_addr) {
*nfree = *nregs = *size = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We try avoiding double initialization, mainly for the purpose of explicitly setting the values, instead of relying on the default value to cover multiple cases (more error prone). Let's remove this line and the branch below, and see the comments below for changes.

Comment thread src/extent.c

const extent_t *extent = iealloc(tsdn, ptr);
if (unlikely(extent == NULL)) {
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... add *nfree = *nregs = *size = 0;, then goto label_return;

Comment thread src/extent.c
*size = extent_size_get(extent);
if (!extent_slab_get(extent)) {
*nregs = 1;
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... add *nfree = 0; and then goto label_return.

Comment thread src/extent.c
@yinan1048576 yinan1048576 changed the title Add fragmentation analytics to mallctl Add memory utilization analytics to mallctl Apr 4, 2019
@yinan1048576

Copy link
Copy Markdown
Contributor Author

Revised the following:

  • Added two additional output fields bin_nfree and bin_nregs for single pointer query. This is to ensure that this new API is fully compatible with redis usage case.
  • Improved code quality & readability, comment and unit tests.

FYI in the latest redis usage (see https://github.com/antirez/redis/blob/167705519b5f21b2aa3954527be15dc653131221/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L219 for the implementation and https://github.com/antirez/redis/blob/22e9321c3ee238f498980cf18781e10caaa01589/src/defrag.c#L57 for the caller), the fragmentation is triggered at the following condition:

je_get_defrag_hint(ptr, &bin_util, &run_util)
    && run_util <= bin_util
    && run_util < 1<<16

which is equivalent to:

ptr >= SLABCUR_READ(out) && ptr < SLABCUR_READ(out) + SIZE_READ(out)
    && NFREE_READ(out) * BIN_NREGS_READ(out) >= NREGS_READ(out) * BIN_NFREE_READ(out)
    && NFREE_READ(out) > 0

assuming out points to the mallctl output.

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Tests fail when --disable-stats is set, which is expected. Fixing it...

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Fixed the failure. It also makes me realize that the redis implementation assumed stats are turned on, and might crash otherwise since there can be a "divide by zero".

Comment thread src/ctl.c Outdated
* In case of large class allocations, "(a)" will be NULL, and "(e)" and "(f)"
* will be zero. The other three fields will be properly set though the values
* are trivial: "(b)" will be 0, "(c)" will be 1, and "(d)" will be the original
* allocation size rounded up to the nearest size class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the term here is usable size.

Comment thread src/ctl.c Outdated
static int
experimental_utilization_query_ctl(tsd_t *tsd, const size_t *mib,
size_t miblen, void *oldp, size_t *oldlenp, void *newp, size_t newlen) {
int ret = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similarly, try avoiding double initialization. Here we only need to define ret. And on the success path, do ret = 0 before label_return.

Comment thread src/ctl.c Outdated
experimental_utilization_query_ctl(tsd_t *tsd, const size_t *mib,
size_t miblen, void *oldp, size_t *oldlenp, void *newp, size_t newlen) {
int ret = 0;
void *ptr = NULL;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to initialize.

Comment thread src/ctl.c Outdated
assert(sizeof(extent_util_stats_verbose_t)
== sizeof(void *) + sizeof(size_t) * 5);

WRITE(ptr, void *);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's move this to after the input check below; and also check for newp in the branch.

Comment thread src/ctl.c Outdated
* This API is mainly intended for small class allocations, where extents are
* used as slab. In case of large class allocations, the outputs are trivial:
* "(a)" will be 0, "(b)" will be 1, and "(c)" will be the original allocation
* size rounded up to the nearest size class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: term usable size too.

Comment thread src/ctl.c Outdated
static int
experimental_utilization_batch_query_ctl(tsd_t *tsd, const size_t *mib,
size_t miblen, void *oldp, size_t *oldlenp, void *newp, size_t newlen) {
int ret = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, not double initialize ret.

@yinan1048576

Copy link
Copy Markdown
Contributor Author

Revised according to feedback.

The analytics tool is put under experimental.utilization namespace in
mallctl.  Input is one pointer or an array of pointers and the output
is a list of memory utilization statistics.
Comment thread test/unit/mallctl.c
test_hooks,
test_hooks_exhaustion);
test_hooks_exhaustion,
test_utilization_query,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The util query tests deserve their own tests I feel. Let's move them to a util_query.c in a follow up diff.

@zvi-code

zvi-code commented Jun 10, 2024

Copy link
Copy Markdown

I plan to issue PR [link TBD] to Valkey, that replaces the existing patch valkey with this API. One peace of information that current algorithm uses and is not included in the API is the number of fullslabs in the bin. The motivation for having the number of fullslabs is to improve the convergence of defrag(the fullslabs are shifting the average and may cause stagnation in defrag process).
What is the best alternative way I can fetch this information with one of the existing API?

Probably this API will do the work:

stats.arenas..bins..curslabs (size_t) r- [--enable-stats]
stats.arenas..bins..nonfull_slabs (size_t) r- [--enable-stats]

@zuiderkwast

Copy link
Copy Markdown

It's interesting that there are no links between this PR and #566. Until today I was not aware of this PR.

madolson added a commit to valkey-io/valkey that referenced this pull request Nov 22, 2024
### Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](jemalloc/jemalloc#1463 (comment))
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](https://github.com/valkey-io/valkey/blob/9f8185f5c80bc98bdbc631b90ccf13929d6a0cbc/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L382))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.

### Added files: 

`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.


### Moved functions: 

[`zmalloc_no_tcache` , `zfree_no_tcache`
](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/src/zmalloc.c#L215)
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file

### Existing API:

* [once a second + when completed full cycle]
[`computeDefragCycles`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/src/defrag.c#L916)
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/src/zmalloc.c#L690)
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L368)
this information points:
        * #`nonfull_slabs`
        * #`total_slabs`
        * #free regs in the ptr slab

## Jemalloc API (via ctl interface)


[BATCH][`experimental_utilization_batch_query_ctl`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/jemalloc/src/ctl.c#L4114)
: gets an array of pointers, returns for each pointer 3 values,

* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes


[EXTENDED][`experimental_utilization_query_ctl`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/jemalloc/src/ctl.c#L3989)
:

* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to

### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
   - We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily

 
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.


## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
    
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
    mallctlbymib().

6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](https://github.com/valkey-io/valkey/blob/4593dc2f059661e1c4eb43bba025f68948344228/deps/Makefile#L115)
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](https://github.com/valkey-io/valkey/blob/9f8185f5c80bc98bdbc631b90ccf13929d6a0cbc/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h#L382)
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.


### Logical differences:

In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .


## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.

### Test: 
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1

#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`

*both runs perform same "work" number of buffers moved to reach
fragmentation target

Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint` 
- compare with naive defrag all: `int defrag_hint() {return 1;}`

---------

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants