-
Notifications
You must be signed in to change notification settings - Fork 24.4k
fix a rare active defrag edge case bug leading to stagnation #7289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
also support: debug mallctl-str thread.tcache.flush VOID
There's a rare case which leads to stagnation in the defragger, causing it to keep scanning the keyspace and do nothing (not moving any allocation), this happens when all the allocator slabs of a certain bin have the same % utilization, but the slab from which new allocations are made have a lower utilization. this commit fixes it by removing the current slab from the overall average utilization of the bin, and also eliminate any precision loss in the utilization calculation and move the decision about the defrag to reside inside jemalloc. and also add a test that consistently reproduce this issue.
Contributor
|
Thank you @oranagra |
JackieXie168
pushed a commit
to JackieXie168/redis
that referenced
this pull request
Jun 18, 2020
fix a rare active defrag edge case bug leading to stagnation
Contributor
Member
Author
|
Yes, I'm aware of that... Apparently I wasn't able to solve the problem. |
oranagra
added a commit
that referenced
this pull request
Nov 21, 2021
Background: Following the upgrade to jemalloc 5.2, there was a test that used to be flaky and started failing consistently (on 32bit), so we disabled it (see #9645). This is a test that i introduced in #7289 when i attempted to solve a rare stagnation problem, and it later turned out i failed to solve it, ans what's more i added a test that caused it to be not so rare, and as i mentioned, now in jemalloc 5.2 it became consistent on 32bit. Stagnation can happen when all the slabs of the bin are equally utilized, so the decision to move an allocation from a relatively empty slab to a relatively full one, will never happen, and in that test all the slabs are at 50% utilization, so the defragger could just keep scanning the keyspace and not move anything. What this PR changes: * First, finally in jemalloc 5.2 we have the count of non-full slabs, so when we compare the utilization of the current slab, we can compare it to the average utilization of the non-full slabs in our bin, instead of the total average of our bin. this takes the full slabs out of the game, since they're not candidates for migration (neither source nor target). * Secondly, We add some 12% (100/8) to the decision to defrag an allocation, this is the part that aims to avoid stagnation, and it's especially important since the above mentioned change can get us closer to stagnation. * Thirdly, since jemalloc 5.2 adds sharded bins, we take into account all shards (something that's missing from the original PR that merged it), this isn't expected to make any difference since anyway there should be just one shard. How this was benchmarked. What i did was run the memefficiency test unit with `--verbose` and compare the defragger hits and misses the tests reported. At first, when i took into consideration only the non-full slabs, it got a lot worse (i got into stagnation, or just got a lot of misses and a lot of hits), but when i added the 10% i got back to results that were slightly better than the ones of the jemalloc 5.1 branch. i.e. full defragmentation was achieved with fewer hits (relocations), and fewer misses (keyspace scans).
hwware
pushed a commit
to hwware/redis
that referenced
this pull request
Dec 20, 2021
Background: Following the upgrade to jemalloc 5.2, there was a test that used to be flaky and started failing consistently (on 32bit), so we disabled it (see redis#9645). This is a test that i introduced in redis#7289 when i attempted to solve a rare stagnation problem, and it later turned out i failed to solve it, ans what's more i added a test that caused it to be not so rare, and as i mentioned, now in jemalloc 5.2 it became consistent on 32bit. Stagnation can happen when all the slabs of the bin are equally utilized, so the decision to move an allocation from a relatively empty slab to a relatively full one, will never happen, and in that test all the slabs are at 50% utilization, so the defragger could just keep scanning the keyspace and not move anything. What this PR changes: * First, finally in jemalloc 5.2 we have the count of non-full slabs, so when we compare the utilization of the current slab, we can compare it to the average utilization of the non-full slabs in our bin, instead of the total average of our bin. this takes the full slabs out of the game, since they're not candidates for migration (neither source nor target). * Secondly, We add some 12% (100/8) to the decision to defrag an allocation, this is the part that aims to avoid stagnation, and it's especially important since the above mentioned change can get us closer to stagnation. * Thirdly, since jemalloc 5.2 adds sharded bins, we take into account all shards (something that's missing from the original PR that merged it), this isn't expected to make any difference since anyway there should be just one shard. How this was benchmarked. What i did was run the memefficiency test unit with `--verbose` and compare the defragger hits and misses the tests reported. At first, when i took into consideration only the non-full slabs, it got a lot worse (i got into stagnation, or just got a lot of misses and a lot of hits), but when i added the 10% i got back to results that were slightly better than the ones of the jemalloc 5.1 branch. i.e. full defragmentation was achieved with fewer hits (relocations), and fewer misses (keyspace scans).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There's a rare case which leads to stagnation in the defragger, causing
it to keep scanning the keyspace and do nothing (not moving any
allocation), this happens when all the allocator slabs of a certain bin
have the same % utilization, but the slab from which new allocations are
made have a lower utilization.
this commit fixes it by removing the current slab from the overall
average utilization of the bin, and also eliminate any precision loss in
the utilization calculation and move the decision about the defrag to
reside inside jemalloc.
and also add a test that consistently reproduce this issue.