Improved active-defrag #10586

yoav-steinberg · 2022-04-14T13:38:56Z

This PR tries to fix the heuristic nature of the current active defrag algorithm.

note: this was a joint effort with @inbaryuval.

The current algorithm

We check each allocation's slab's fullness, and then compares that to the average fullness of all slabs in the bin. If the slab is fuller than the average slab in the bin it keeps this allocation, if not it reallocates it.
The logic behind this approach is that [re]allocations always go to the lowest slabs making it statistically probable that lower slabs will be fuller and higher ones.
The problem is this isn't always the case. The algorithm may reallocate a value which resides in a slab that isn't going to be freed or may skip reallocation of some values in the initial scan, requiring additional keyspace scans to complete the defragmentation process.
In practice we added safety thresholds when comparing slab fullness and added extra tests for edge cases that caused the algorithm never to complete without these safety factors. See (#7289, #9778).

The proposed algorithm

To avoid the problems above we can decide if an allocation needs to be moved from a slab in a non-heuristic way. jemalloc always allocates on the lowest slab in the bin with free space in it (in an attempt to minimize fragmentation in the first place). This means that given N slabs in the bin, at the end of the defragmentation process the lower M slabs (M<N) will be full while the higher N-M slabs will be freed. Given this knowledge all we need to do is per allocation figure out if that allocation's slab is in the lower M slabs of not. If it is then we don't need to reallocate it, if it isn't we need to reallocate it and it'll by definition move to somewhere in the lower M slabs.

Details

The algorithm needs to find the slab M for each bin which is the highest slab to be retained, all allocations in slabs with a higher address than M will need to be reallocated. The first phase of the algorithm calculates how many slabs are actually required for all the allocations in the bin: defrag_slabs_to_retain.

Setup phase

Once we know the value of defrag_slabs_to_retain we need to find the address of the highest slab in the group of the lowest defrag_slabs_to_retain slabs. We do this by popping the lowest defrag_slabs_to_retain slabs out of the bin's slabs_nonfull heap. We store the address of the last slab to be popped in highest_slab_to_retain.
All popped items are stored in a temporary heap: slabs_nonfull_temp and once we know highest_slab_to_retain we merge this temporary heap back into the bins normal slabs_nonfull heap.
Note that the complexity of storing the all popped items into the temp heap is O(N*Log(N)) where is N is the number of popped slabs. This means that the setup phase may block for a long while if there are many non-full slabs.

Incremental setup

To handle the possible blocking during the setup phase we perform the setup phase incrementally. After calculating defrag_slabs_to_retain for each we run the setup phase on each bin with a timeout. If the timeout expires we stop. At this point slabs_nonfull is split into slabs_nonfull and slabs_nonfull_temp, any attempt to perform operations (get_first, remove, insert) on the "nonfull" heap of this bin will now access the relevant heap of the two. The defrag cron will resume the setup phase in its next cycle. Once the bin's setup phase is done we mark it as so and merge the two heaps back into a single slabs_nonfull heap.

Defrag phase

During this phase, instead of performing the above described heuristic we just compare the allocation in question's slab to highest_slab_to_retain, if it's above that slab's address then we reallocate, if not, we keep this allocation in place.

Add some benchmarks.

… n slabs to retain, them merge back with a single call instead of loop.

This change also includes comparing to a threshold slab address instead of flagging all slabs to retain in order to decide which slabs to retain and which to defrag.

Only when both `bin->initing_defrag && bin->highest_slab_to_retain_inited` we need to use the temp heap.

yoav-steinberg · 2022-05-02T08:12:49Z

Simple benchmark:

Run redis-server --save ""
Run the benchmark script.

metric	unstable	new_defrag_algo branch
Frag at end of test	1.03	1.0
Hits	1,928,881	1,499,838
Misses	9,385,615	2,000,386
Time	12.69s	5.12s

sundb · 2022-05-06T15:32:51Z

deps/jemalloc/src/jemalloc.c

+    tsd_t * tsd = tsd_fetch();
+    arena_t *arena = arena_choose(tsd, NULL);


AFAIK, arena is thread-bind, since defrag is only run in the main thread, could this lead to overlooking internal fragmentation caused by threads?

I think you mean external-fragmentation, but yes, you are right, defrag will only handle main thread allocations. But these are the Redis keys and values which are what we're aiming at. This isn't a change from the existing ative-defrag mechanism (@oranagra am I right here?).

@yoav-steinberg i didn't look at the code of this PR yet, but i suspect you're wrong.
In redis both the IO threads and modules can current use allocations from other arenas, and that can actually be one of the causes for high fragmentation.
the original active-defrag mechanism handled these well, since for each allocation it used to compare the current slap frag ratio to the arena average bin frag as an indication for the need to re-allocate.

it's true that the re-allocate itself, would later move it to another arena, but i think that would have eventually reduce fragmentation.

If the new code only attempts to defrag allocations of the main arena, i think this is something we better fix.

Some analysis:

I think our key/value allocations are from the main arena even when using I/O threads. This is because the query buffers aren't used for the key/values (even if they are used for argv).

In case of modules, however, we can't assume this. So you are right there's a potential bug here.

I was wrong re the old algorithm. I see it only looks at the slab's arena in order to make decisions and not at the main
arena. But this means the old algorithm is less efficient in the case the allocation wasn't from the main arena. In that case we can simply return 1 from get_defrag_hint() since any fragmentation in the non-main arena will never resolve itself if we realloc to the main arena. So moving everything out of the non-main arena is the only option in this case.

So bottom line:

In the old algorithm we should always consider allocations in non-main arena's non-full stabs as hits.

In new algorithm we should do the same.

Ideally we should defrag each arena separately and then do the reallocations inside the same arena. This is possible by running the setup phase on each arena separately (instead of only on the main arena) and then doing the realloc on hits inside the same arena.

I think our key/value allocations are from the main arena even when using I/O threads. This is because the query buffers aren't used for the key/values (even if they are used for argv).

We may create `c->argv` in the non-main thread, and execute the client command in the main thread, then we may just increment the `refcount` of robj in `c->argv` to create key or value, then this memory will be allocated by non-main thread. Please point out if I'm wrong

Yoav was referring to the fact that key names and value of most types are sds, so they're not the same robj allocation from the argv. and for strings we call tryObjectEncoding, which i might told him that copies the object, but i now realize that i might have confused what trimStringObjectIfNeeded does (it copies the object only if it has a lot of excess space).
but anyway the final conclusion is the same, since modules may RM_RetainString these argv items and put them in the key-space, and modules can participate in the defrag mechanism (they have an API for that). so either way, we need to defrag all arenas.

@oranagra @sundb I updated to code to handle multiple arenas. I did some local testing and this seems to work fine. Please re-review.

deps/jemalloc/src/jemalloc.c

sundb · 2022-05-17T13:31:59Z

deps/jemalloc/src/jemalloc.c

+        used_slabs += full_slabs;
+        assert(bin->highest_slab_to_retain_inited == false);
+        assert(extent_heap_empty(&bin->slabs_nonfull_temp));
+        if (used_slabs != needed_slabs) {


Whether a threshold is needed, for example when ((used_slabs - neeed_slabs) / used_slabs) > threshold(maybe active_defrag_threshold_lower).

If there aren't many more used slabs than needed slabs then the init phase should be quick. So I'm not sure there's much value in this additional threshold.

Because defrag does not defragment all memory, if some regions are occupied by memory that is not defragmented, it is possible that used_slabs will never equal needed_slabs.

You are right, but the original and also this algorithm mitigates this by requiring the user to set relevant/sane active-defrag-threshold-lower and active-defrag-ignore-bytes. I'm not sure we need another threshold here, specifically this will be a rather difficult one to explain.

sundb · 2022-05-18T16:06:53Z

@yoav-steinberg Use used_slabs and needed_slabs to calculate whether a bin needs to be defragged which is more accurate than the old.
But I don't like slabs_nonfull_temp, when we need to defrag a ptr, why not just get the first slab (lowest non-full) from bin->slabs_nonfull, then determine if it is smaller than this ptr's slab, if so that means we can move ptr to this slab.
This avoids merging slabs_nonfull_temp and slabs_nonfull, not sure if the performance is better.

yoav-steinberg · 2022-05-19T10:39:33Z

when we need to defrag a ptr, why not just get the first slab (lowest non-full) from bin->slabs_nonfull, then determine if it is smaller than this ptr's slab, if so that means we can move ptr to this slab. This avoids merging slabs_nonfull_temp and slabs_nonfull, not sure if the performance is better.

If we just compare the ptr's stab to the lowest in slabs_nonfull then there are 3 options about where ptr is:

ptr is in a full slab. In that case we don't want to move it because moving it will increase fragmentation.
ptr is in curslab. In that case moving it will do nothing (it'll move from the current slabs to the current slab).
ptr is in the non-full slab heap. In that case there's an almost certain chance that the address of ptr's slab is higher than first (lowest) address in slabs_nonfull. So we'll always skip moving and won't defrag anything. What we're trying to do is figure out which are the lowest needed_slabs in slabs_nonfull, and make sure to move all other ptrs to one of those lower slabs.

CLAassistant · 2024-03-24T23:14:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

yoav-steinberg marked this pull request as ready for review April 21, 2022 16:58

yoav-steinberg requested a review from oranagra April 21, 2022 16:58

yoav-steinberg and others added 22 commits April 24, 2022 21:08

wip: new defrag algo, initial commit

24c2c28

cleanup prints

047fcb8

fix reverse retain/defrag logic

48e070c

make sure to correctly init the defrag retain flag

a15d8e4

wip, dbg prints

3c8f4d1

Clean up defrag-retain flag when slab isn't non-full anymore.

7551491

saner tmp profiling.

adeb3ed

Make defrag logging visible in notice level.

58143f8

Possible slight optimizatin: use temp ph instead of list when setting…

f47fc07

… n slabs to retain, them merge back with a single call instead of loop.

Incremental initing of defrag process.

40e51b0

This change also includes comparing to a threshold slab address instead of flagging all slabs to retain in order to decide which slabs to retain and which to defrag.

Fixed buggy assertion:

2ee1776

Only when both `bin->initing_defrag && bin->highest_slab_to_retain_inited` we need to use the temp heap.

Cleanups

1e8b191

misc cleanups and debug

bdb6a71

bugfix: improper use of macros

c2e7f0c

allow debug malloc-stats while loading

6a6519e

Test requires longer AOF loading time for defrag to kick in

980c898

logging

a6a608b

formatting/cleanups

8d6b831

formatting

0234a48

Cleanups

e310a2e

updated commands.c

d2efb31

cleanup

04b3897

yoav-steinberg force-pushed the new_defrag_algo branch from d1e3664 to 04b3897 Compare May 2, 2022 07:00

cleanup

3e11001

sundb reviewed May 6, 2022

View reviewed changes

inbaryuval reviewed May 8, 2022

View reviewed changes

deps/jemalloc/src/jemalloc.c Outdated Show resolved Hide resolved

Added generic heap merge API

e4d0dab

sundb reviewed May 17, 2022

View reviewed changes

deps/jemalloc/src/jemalloc.c Outdated Show resolved Hide resolved

sundb reviewed May 17, 2022

View reviewed changes

yoav-steinberg added 2 commits May 19, 2022 18:25

Don't attempt to defrag a full slab.

16de355

Support active-defrag from multiple arenas (fix threaded io scenario)

0971d62

oranagra mentioned this pull request Feb 6, 2024

Fix the failure of defrag test under 32-bit #13013

Merged

		tsd_t * tsd = tsd_fetch();
		arena_t *arena = arena_choose(tsd, NULL);

Improved active-defrag #10586

Are you sure you want to change the base?

Improved active-defrag #10586

Uh oh!

Conversation

yoav-steinberg commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The current algorithm

The proposed algorithm

Details

Setup phase

Incremental setup

Defrag phase

Uh oh!

yoav-steinberg commented May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoav-steinberg May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundb May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundb commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoav-steinberg commented May 19, 2022

Uh oh!

CLAassistant commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yoav-steinberg commented Apr 14, 2022 •

edited

Loading

yoav-steinberg commented May 2, 2022 •

edited

Loading

yoav-steinberg May 9, 2022 •

edited

Loading

sundb May 16, 2022 •

edited

Loading

sundb commented May 18, 2022 •

edited

Loading