-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Improved active-defrag #10586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Improved active-defrag #10586
Conversation
… n slabs to retain, them merge back with a single call instead of loop.
This change also includes comparing to a threshold slab address instead of flagging all slabs to retain in order to decide which slabs to retain and which to defrag.
Only when both `bin->initing_defrag && bin->highest_slab_to_retain_inited` we need to use the temp heap.
d1e3664 to
04b3897
Compare
|
Simple benchmark:
|
deps/jemalloc/src/jemalloc.c
Outdated
| tsd_t * tsd = tsd_fetch(); | ||
| arena_t *arena = arena_choose(tsd, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, arena is thread-bind, since defrag is only run in the main thread, could this lead to overlooking internal fragmentation caused by threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean external-fragmentation, but yes, you are right, defrag will only handle main thread allocations. But these are the Redis keys and values which are what we're aiming at. This isn't a change from the existing ative-defrag mechanism (@oranagra am I right here?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoav-steinberg i didn't look at the code of this PR yet, but i suspect you're wrong.
In redis both the IO threads and modules can current use allocations from other arenas, and that can actually be one of the causes for high fragmentation.
the original active-defrag mechanism handled these well, since for each allocation it used to compare the current slap frag ratio to the arena average bin frag as an indication for the need to re-allocate.
it's true that the re-allocate itself, would later move it to another arena, but i think that would have eventually reduce fragmentation.
If the new code only attempts to defrag allocations of the main arena, i think this is something we better fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some analysis:
I think our key/value allocations are from the main arena even when using I/O threads. This is because the query buffers aren't used for the key/values (even if they are used for argv).
In case of modules, however, we can't assume this. So you are right there's a potential bug here.
I was wrong re the old algorithm. I see it only looks at the slab's arena in order to make decisions and not at the main
arena. But this means the old algorithm is less efficient in the case the allocation wasn't from the main arena. In that case we can simply return 1 from get_defrag_hint() since any fragmentation in the non-main arena will never resolve itself if we realloc to the main arena. So moving everything out of the non-main arena is the only option in this case.
So bottom line:
- In the old algorithm we should always consider allocations in non-main arena's non-full stabs as hits.
- In new algorithm we should do the same.
- Ideally we should defrag each arena separately and then do the reallocations inside the same arena. This is possible by running the setup phase on each arena separately (instead of only on the main arena) and then doing the realloc on hits inside the same arena.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our key/value allocations are from the main arena even when using I/O threads. This is because the query buffers aren't used for the key/values (even if they are used for argv).
We may create `c->argv` in the non-main thread, and execute the client command in the main thread,
then we may just increment the `refcount` of robj in `c->argv` to create key or value, then this memory
will be allocated by non-main thread.
Please point out if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yoav was referring to the fact that key names and value of most types are sds, so they're not the same robj allocation from the argv. and for strings we call tryObjectEncoding, which i might told him that copies the object, but i now realize that i might have confused what trimStringObjectIfNeeded does (it copies the object only if it has a lot of excess space).
but anyway the final conclusion is the same, since modules may RM_RetainString these argv items and put them in the key-space, and modules can participate in the defrag mechanism (they have an API for that). so either way, we need to defrag all arenas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deps/jemalloc/src/jemalloc.c
Outdated
| used_slabs += full_slabs; | ||
| assert(bin->highest_slab_to_retain_inited == false); | ||
| assert(extent_heap_empty(&bin->slabs_nonfull_temp)); | ||
| if (used_slabs != needed_slabs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether a threshold is needed, for example when ((used_slabs - neeed_slabs) / used_slabs) > threshold(maybe active_defrag_threshold_lower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there aren't many more used slabs than needed slabs then the init phase should be quick. So I'm not sure there's much value in this additional threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because defrag does not defragment all memory, if some regions are occupied by memory that is not defragmented, it is possible that used_slabs will never equal needed_slabs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, but the original and also this algorithm mitigates this by requiring the user to set relevant/sane active-defrag-threshold-lower and active-defrag-ignore-bytes. I'm not sure we need another threshold here, specifically this will be a rather difficult one to explain.
|
@yoav-steinberg Use |
If we just compare the ptr's stab to the lowest in
|
|
|
This PR tries to fix the heuristic nature of the current active defrag algorithm.
note: this was a joint effort with @inbaryuval.
The current algorithm
We check each allocation's slab's fullness, and then compares that to the average fullness of all slabs in the bin. If the slab is fuller than the average slab in the bin it keeps this allocation, if not it reallocates it.
The logic behind this approach is that [re]allocations always go to the lowest slabs making it statistically probable that lower slabs will be fuller and higher ones.
The problem is this isn't always the case. The algorithm may reallocate a value which resides in a slab that isn't going to be freed or may skip reallocation of some values in the initial scan, requiring additional keyspace scans to complete the defragmentation process.
In practice we added safety thresholds when comparing slab fullness and added extra tests for edge cases that caused the algorithm never to complete without these safety factors. See (#7289, #9778).
The proposed algorithm
To avoid the problems above we can decide if an allocation needs to be moved from a slab in a non-heuristic way. jemalloc always allocates on the lowest slab in the bin with free space in it (in an attempt to minimize fragmentation in the first place). This means that given N slabs in the bin, at the end of the defragmentation process the lower M slabs (M<N) will be full while the higher N-M slabs will be freed. Given this knowledge all we need to do is per allocation figure out if that allocation's slab is in the lower M slabs of not. If it is then we don't need to reallocate it, if it isn't we need to reallocate it and it'll by definition move to somewhere in the lower M slabs.
Details
The algorithm needs to find the slab M for each bin which is the highest slab to be retained, all allocations in slabs with a higher address than M will need to be reallocated. The first phase of the algorithm calculates how many slabs are actually required for all the allocations in the bin:
defrag_slabs_to_retain.Setup phase
Once we know the value of
defrag_slabs_to_retainwe need to find the address of the highest slab in the group of the lowestdefrag_slabs_to_retainslabs. We do this by popping the lowestdefrag_slabs_to_retainslabs out of the bin'sslabs_nonfullheap. We store the address of the last slab to be popped inhighest_slab_to_retain.All popped items are stored in a temporary heap:
slabs_nonfull_tempand once we knowhighest_slab_to_retainwe merge this temporary heap back into the bins normalslabs_nonfullheap.Note that the complexity of storing the all popped items into the temp heap is O(N*Log(N)) where is N is the number of popped slabs. This means that the setup phase may block for a long while if there are many non-full slabs.
Incremental setup
To handle the possible blocking during the setup phase we perform the setup phase incrementally. After calculating
defrag_slabs_to_retainfor each we run the setup phase on each bin with a timeout. If the timeout expires we stop. At this pointslabs_nonfullis split intoslabs_nonfullandslabs_nonfull_temp, any attempt to perform operations (get_first, remove, insert) on the "nonfull" heap of this bin will now access the relevant heap of the two. The defrag cron will resume the setup phase in its next cycle. Once the bin's setup phase is done we mark it as so and merge the two heaps back into a singleslabs_nonfullheap.Defrag phase
During this phase, instead of performing the above described heuristic we just compare the allocation in question's slab to
highest_slab_to_retain, if it's above that slab's address then we reallocate, if not, we keep this allocation in place.