Refactor of ActiveDefrag to reduce latencies#1242
Refactor of ActiveDefrag to reduce latencies#1242madolson merged 2 commits intovalkey-io:unstablefrom
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1242 +/- ##
============================================
+ Coverage 70.62% 70.78% +0.15%
============================================
Files 117 118 +1
Lines 63313 63430 +117
============================================
+ Hits 44716 44896 +180
+ Misses 18597 18534 -63
|
ffdf32a to
4054b1f
Compare
|
@zvi-code Would you mind also helping taking a look on this PR. |
|
@JimB123 , I skimmed through the code, at a very high level, it looks good (and the results you posted are very promising!). I will followup with a more detailed review of the code. |
|
@zvi-code The Valkey event loop supports creation of various timer events. But this For defrag, there are 2 things that need to happen. First, we need to make a decision if we should begin to defrag. It makes sense to decide this using the main In the old code, if we target 10% of the CPU, that was done by letting defrag run for (a continuous) 10ms every time the 100ms In the new code, with the same 10% CPU target, we run the defrag job for only 500us, but schedule it on it's own dedicated timer so that it can run more often. The code modulates the frequency rather than the duration. (In extreme cases, anti-starvation must still modulate the duration as starvation impacts the frequency.)
I haven't looked at the lazy expiration before, but looking at it now, that's exactly what I'm saying. We shouldn't be doing ANYTHING in
OK, wow, this is another thing I haven't looked at before. IMO, This |
madolson
left a comment
There was a problem hiding this comment.
Mostly looks good, much more readable than before! Some minor comments.
3f7d0ad to
825da4f
Compare
|
@valkey-io/core-team there is a small major change here in the form of a config that determines how much time we spend on active Defrag. Please take a look and provide 👍 or 👎 . You can see the definition here: Other than that, this change just better amortizes the active-defrag work to reduce latency spikes. |
|
@JimB123 , added several comments, will continue to complete the review tomorrow |
825da4f to
a943d5e
Compare
madolson
left a comment
There was a problem hiding this comment.
Just cleaning up merge conflicts.
madolson
left a comment
There was a problem hiding this comment.
A few more merge conflict resolutions.
madolson
left a comment
There was a problem hiding this comment.
I think this looks good, just waiting for Zvi review and getting TSC consensus. I pinged them on the slack as well to take a look here.
This config is similar to how we control active rehashing and active expire. I just want to put it into context to confirm this is the config we want. In general, we have Binbin recently adjusted active rehashing to do work based on For active defrag, we already have |
zvi-code
left a comment
There was a problem hiding this comment.
I still have some questions about risk that this change will actually increase tail latencies in some scenarios. Added specific comments\questions in relevant code
That explains it. Thanks! |
zuiderkwast
left a comment
There was a problem hiding this comment.
Thanks for this feature. I only reviewed the configs.
0ed8c4b to
6cf4898
Compare
|
At this point, have rebased and either addressed or responded to all comments. |
6cf4898 to
d110d12
Compare
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
d110d12 to
4d2777b
Compare
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
Not a full review. I trust the other reviewers.
The recent PR (#1242) converted Active Defrag to use `monotime`. In that change, a conversion was performed to continue to use `ustime()` as part of the module interface. Since this time is only used internally, and never actually exposed to the module, we can convert this to use `monotime` directly. Signed-off-by: Jim Brunner <brunnerj@amazon.com>
This update addresses several issues in defrag: 1. In the defrag redesign (#1242), a bug was introduced where `server.cronloops` was no longer being incremented in the `whileBlockedCron()`. This resulted in some memory statistics not being updated while blocked. 2. In the test case for AOF loading, we were seeing errors due to defrag latencies. However, running the math, the latencies are justified given the extremely high CPU target of the testcase. Adjusted the expected latency check to allow longer latencies for this case where defrag is undergoing starvation while AOF loading is in progress. 3. A "stage" is passed a "target". For the main dictionary and expires, we were passing in a `kvstore*`. However, on flushall or swapdb, the pointer may change. It's safer and more stable to use an index for the DB (a DBID). Then if the pointer changes, we can detect the change, and simply abort the stage. (If there's still fragmentation to deal with, we'll pick it up again on the next cycle.) 4. We always start a new stage on a new defrag cycle. This gives the new stage time to run, and prevents latency issues for certain stages which don't operate incrementally. However, often several stages will require almost no work, and this will leave a chunk of our CPU allotment unused. This is mainly an issue in starvation situations (like AOF loading or LUA script) - where defrag is running infrequently, with a large duty-cycle. This change allows a new stage to be initiated if we still have a standard duty-cycle remaining. (This can happen during starvation situations where the planned duty cycle is larger than the standard cycle. Most likely this isn't a concern for real scenarios, but it was observed in testing.) 5. Minor comment correction in `server.h` Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Refer to: valkey-io#1141 This update refactors the defrag code to: * Make the overall code more readable and maintainable * Reduce latencies incurred during defrag processing With this update, the defrag cycle time is reduced to 500us, with more frequent cycles. This results in much more predictable latencies, with a dramatic reduction in tail latencies. (See valkey-io#1141 for more complete details.) This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g. `activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of scope for this update and left for a future update. I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME single shard. See original github issue for performance details. --------- Signed-off-by: Jim Brunner <brunnerj@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The recent PR (valkey-io#1242) converted Active Defrag to use `monotime`. In that change, a conversion was performed to continue to use `ustime()` as part of the module interface. Since this time is only used internally, and never actually exposed to the module, we can convert this to use `monotime` directly. Signed-off-by: Jim Brunner <brunnerj@amazon.com>
…key-io#1430) This update addresses several issues in defrag: 1. In the defrag redesign (valkey-io#1242), a bug was introduced where `server.cronloops` was no longer being incremented in the `whileBlockedCron()`. This resulted in some memory statistics not being updated while blocked. 2. In the test case for AOF loading, we were seeing errors due to defrag latencies. However, running the math, the latencies are justified given the extremely high CPU target of the testcase. Adjusted the expected latency check to allow longer latencies for this case where defrag is undergoing starvation while AOF loading is in progress. 3. A "stage" is passed a "target". For the main dictionary and expires, we were passing in a `kvstore*`. However, on flushall or swapdb, the pointer may change. It's safer and more stable to use an index for the DB (a DBID). Then if the pointer changes, we can detect the change, and simply abort the stage. (If there's still fragmentation to deal with, we'll pick it up again on the next cycle.) 4. We always start a new stage on a new defrag cycle. This gives the new stage time to run, and prevents latency issues for certain stages which don't operate incrementally. However, often several stages will require almost no work, and this will leave a chunk of our CPU allotment unused. This is mainly an issue in starvation situations (like AOF loading or LUA script) - where defrag is running infrequently, with a large duty-cycle. This change allows a new stage to be initiated if we still have a standard duty-cycle remaining. (This can happen during starvation situations where the planned duty cycle is larger than the standard cycle. Most likely this isn't a concern for real scenarios, but it was observed in testing.) 5. Minor comment correction in `server.h` Signed-off-by: Jim Brunner <brunnerj@amazon.com>
…key-io#1430) This update addresses several issues in defrag: 1. In the defrag redesign (valkey-io#1242), a bug was introduced where `server.cronloops` was no longer being incremented in the `whileBlockedCron()`. This resulted in some memory statistics not being updated while blocked. 2. In the test case for AOF loading, we were seeing errors due to defrag latencies. However, running the math, the latencies are justified given the extremely high CPU target of the testcase. Adjusted the expected latency check to allow longer latencies for this case where defrag is undergoing starvation while AOF loading is in progress. 3. A "stage" is passed a "target". For the main dictionary and expires, we were passing in a `kvstore*`. However, on flushall or swapdb, the pointer may change. It's safer and more stable to use an index for the DB (a DBID). Then if the pointer changes, we can detect the change, and simply abort the stage. (If there's still fragmentation to deal with, we'll pick it up again on the next cycle.) 4. We always start a new stage on a new defrag cycle. This gives the new stage time to run, and prevents latency issues for certain stages which don't operate incrementally. However, often several stages will require almost no work, and this will leave a chunk of our CPU allotment unused. This is mainly an issue in starvation situations (like AOF loading or LUA script) - where defrag is running infrequently, with a large duty-cycle. This change allows a new stage to be initiated if we still have a standard duty-cycle remaining. (This can happen during starvation situations where the planned duty cycle is larger than the standard cycle. Most likely this isn't a concern for real scenarios, but it was observed in testing.) 5. Minor comment correction in `server.h` Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Refer to: #1141
This update refactors the defrag code to:
With this update, the defrag cycle time is reduced to 500us, with more frequent cycles. This results in much more predictable latencies, with a dramatic reduction in tail latencies.
(See #1141 for more complete details.)
This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g.
activeDefragSdsDict(), andmoduleDefragGlobals()). These are out of scope for this update and left for a future update.I fixed
kvstoreDictLUTDefragbecause it was using up to 7ms on a CME single shard.During unit tests, the following max latencies were measured (in verbose mode):
In addition, the following test was run on both old and new versions of the software:
Configuration was set so that both old/new clusters would use 10% defrag CPU.
Defrag time OLD: 120 sec
Defrag time NEW: 105 sec
Times were based on a 5-second poll. (I didn't have logs running.)
The improvement in run time is believed to be due to the unskewed nature of the new code which provides a more accurate 10% of the CPU.
This is the OLD distribution for the GET benchmark while defragging:
This is the equivalent NEW distribution:
You can see in the new distribution that there is a very slight increase in latencies <= 99.951%, in exchange for a huge reduction in tail latencies.
A CONTROL test WITHOUT Active Defrag running shows a very similar distribution with a variation of roughly 500us in the tail latencies (as expected):