Skip to content

Conversation

@yoav-steinberg
Copy link
Contributor

@yoav-steinberg yoav-steinberg commented Apr 14, 2022

This PR tries to fix the heuristic nature of the current active defrag algorithm.

note: this was a joint effort with @inbaryuval.

The current algorithm

We check each allocation's slab's fullness, and then compares that to the average fullness of all slabs in the bin. If the slab is fuller than the average slab in the bin it keeps this allocation, if not it reallocates it.
The logic behind this approach is that [re]allocations always go to the lowest slabs making it statistically probable that lower slabs will be fuller and higher ones.
The problem is this isn't always the case. The algorithm may reallocate a value which resides in a slab that isn't going to be freed or may skip reallocation of some values in the initial scan, requiring additional keyspace scans to complete the defragmentation process.
In practice we added safety thresholds when comparing slab fullness and added extra tests for edge cases that caused the algorithm never to complete without these safety factors. See (#7289, #9778).

The proposed algorithm

To avoid the problems above we can decide if an allocation needs to be moved from a slab in a non-heuristic way. jemalloc always allocates on the lowest slab in the bin with free space in it (in an attempt to minimize fragmentation in the first place). This means that given N slabs in the bin, at the end of the defragmentation process the lower M slabs (M<N) will be full while the higher N-M slabs will be freed. Given this knowledge all we need to do is per allocation figure out if that allocation's slab is in the lower M slabs of not. If it is then we don't need to reallocate it, if it isn't we need to reallocate it and it'll by definition move to somewhere in the lower M slabs.

Details

The algorithm needs to find the slab M for each bin which is the highest slab to be retained, all allocations in slabs with a higher address than M will need to be reallocated. The first phase of the algorithm calculates how many slabs are actually required for all the allocations in the bin: defrag_slabs_to_retain.

Setup phase

Once we know the value of defrag_slabs_to_retain we need to find the address of the highest slab in the group of the lowest defrag_slabs_to_retain slabs. We do this by popping the lowest defrag_slabs_to_retain slabs out of the bin's slabs_nonfull heap. We store the address of the last slab to be popped in highest_slab_to_retain.
All popped items are stored in a temporary heap: slabs_nonfull_temp and once we know highest_slab_to_retain we merge this temporary heap back into the bins normal slabs_nonfull heap.
Note that the complexity of storing the all popped items into the temp heap is O(N*Log(N)) where is N is the number of popped slabs. This means that the setup phase may block for a long while if there are many non-full slabs.

Incremental setup

To handle the possible blocking during the setup phase we perform the setup phase incrementally. After calculating defrag_slabs_to_retain for each we run the setup phase on each bin with a timeout. If the timeout expires we stop. At this point slabs_nonfull is split into slabs_nonfull and slabs_nonfull_temp, any attempt to perform operations (get_first, remove, insert) on the "nonfull" heap of this bin will now access the relevant heap of the two. The defrag cron will resume the setup phase in its next cycle. Once the bin's setup phase is done we mark it as so and merge the two heaps back into a single slabs_nonfull heap.

Defrag phase

During this phase, instead of performing the above described heuristic we just compare the allocation in question's slab to highest_slab_to_retain, if it's above that slab's address then we reallocate, if not, we keep this allocation in place.

  • Add some benchmarks.

@yoav-steinberg yoav-steinberg marked this pull request as ready for review April 21, 2022 16:58
@yoav-steinberg yoav-steinberg requested a review from oranagra April 21, 2022 16:58
@yoav-steinberg
Copy link
Contributor Author

yoav-steinberg commented May 2, 2022

Simple benchmark:

  1. Run redis-server --save ""
  2. Run the benchmark script.
metric unstable new_defrag_algo branch
Frag at end of test 1.03 1.0
Hits 1,928,881 1,499,838
Misses 9,385,615 2,000,386
Time 12.69s 5.12s

Comment on lines 4025 to 4026
tsd_t * tsd = tsd_fetch();
arena_t *arena = arena_choose(tsd, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, arena is thread-bind, since defrag is only run in the main thread, could this lead to overlooking internal fragmentation caused by threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean external-fragmentation, but yes, you are right, defrag will only handle main thread allocations. But these are the Redis keys and values which are what we're aiming at. This isn't a change from the existing ative-defrag mechanism (@oranagra am I right here?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoav-steinberg i didn't look at the code of this PR yet, but i suspect you're wrong.
In redis both the IO threads and modules can current use allocations from other arenas, and that can actually be one of the causes for high fragmentation.
the original active-defrag mechanism handled these well, since for each allocation it used to compare the current slap frag ratio to the arena average bin frag as an indication for the need to re-allocate.

it's true that the re-allocate itself, would later move it to another arena, but i think that would have eventually reduce fragmentation.

If the new code only attempts to defrag allocations of the main arena, i think this is something we better fix.

Copy link
Contributor Author

@yoav-steinberg yoav-steinberg May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some analysis:

I think our key/value allocations are from the main arena even when using I/O threads. This is because the query buffers aren't used for the key/values (even if they are used for argv).

In case of modules, however, we can't assume this. So you are right there's a potential bug here.

I was wrong re the old algorithm. I see it only looks at the slab's arena in order to make decisions and not at the main
arena. But this means the old algorithm is less efficient in the case the allocation wasn't from the main arena. In that case we can simply return 1 from get_defrag_hint() since any fragmentation in the non-main arena will never resolve itself if we realloc to the main arena. So moving everything out of the non-main arena is the only option in this case.

So bottom line:

  • In the old algorithm we should always consider allocations in non-main arena's non-full stabs as hits.
  • In new algorithm we should do the same.
  • Ideally we should defrag each arena separately and then do the reallocations inside the same arena. This is possible by running the setup phase on each arena separately (instead of only on the main arena) and then doing the realloc on hits inside the same arena.

Copy link
Collaborator

@sundb sundb May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our key/value allocations are from the main arena even when using I/O threads. This is because the query buffers aren't used for the key/values (even if they are used for argv).

We may create `c->argv` in the non-main thread, and execute the client command in the main thread,
then we may just increment the `refcount` of robj in `c->argv` to create key or value, then this memory
will be allocated by non-main thread.
Please point out if I'm wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yoav was referring to the fact that key names and value of most types are sds, so they're not the same robj allocation from the argv. and for strings we call tryObjectEncoding, which i might told him that copies the object, but i now realize that i might have confused what trimStringObjectIfNeeded does (it copies the object only if it has a lot of excess space).
but anyway the final conclusion is the same, since modules may RM_RetainString these argv items and put them in the key-space, and modules can participate in the defrag mechanism (they have an API for that). so either way, we need to defrag all arenas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra @sundb I updated to code to handle multiple arenas. I did some local testing and this seems to work fine. Please re-review.

used_slabs += full_slabs;
assert(bin->highest_slab_to_retain_inited == false);
assert(extent_heap_empty(&bin->slabs_nonfull_temp));
if (used_slabs != needed_slabs) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether a threshold is needed, for example when ((used_slabs - neeed_slabs) / used_slabs) > threshold(maybe active_defrag_threshold_lower).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there aren't many more used slabs than needed slabs then the init phase should be quick. So I'm not sure there's much value in this additional threshold.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because defrag does not defragment all memory, if some regions are occupied by memory that is not defragmented, it is possible that used_slabs will never equal needed_slabs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but the original and also this algorithm mitigates this by requiring the user to set relevant/sane active-defrag-threshold-lower and active-defrag-ignore-bytes. I'm not sure we need another threshold here, specifically this will be a rather difficult one to explain.

@sundb
Copy link
Collaborator

sundb commented May 18, 2022

@yoav-steinberg Use used_slabs and needed_slabs to calculate whether a bin needs to be defragged which is more accurate than the old.
But I don't like slabs_nonfull_temp, when we need to defrag a ptr, why not just get the first slab (lowest non-full) from bin->slabs_nonfull, then determine if it is smaller than this ptr's slab, if so that means we can move ptr to this slab.
This avoids merging slabs_nonfull_temp and slabs_nonfull, not sure if the performance is better.

@yoav-steinberg
Copy link
Contributor Author

when we need to defrag a ptr, why not just get the first slab (lowest non-full) from bin->slabs_nonfull, then determine if it is smaller than this ptr's slab, if so that means we can move ptr to this slab. This avoids merging slabs_nonfull_temp and slabs_nonfull, not sure if the performance is better.

If we just compare the ptr's stab to the lowest in slabs_nonfull then there are 3 options about where ptr is:

  1. ptr is in a full slab. In that case we don't want to move it because moving it will increase fragmentation.
  2. ptr is in curslab. In that case moving it will do nothing (it'll move from the current slabs to the current slab).
  3. ptr is in the non-full slab heap. In that case there's an almost certain chance that the address of ptr's slab is higher than first (lowest) address in slabs_nonfull. So we'll always skip moving and won't defrag anything. What we're trying to do is figure out which are the lowest needed_slabs in slabs_nonfull, and make sure to move all other ptrs to one of those lower slabs.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

5 participants