Refactor of ActiveDefrag to reduce latencies by sundb · Pull Request #13814 · redis/redis

sundb · 2025-02-19T11:52:03Z

This PR is based on: valkey-io/valkey#1242

Issue/Problems

Duty Cycle: Active Defrag has configuration values which determine the intended percentage of CPU to be used based on a gradient of the fragmentation percentage. However, Active Defrag performs its work on the 100ms serverCron timer. It then computes a duty cycle and performs a single long cycle. For example, if the intended CPU is computed to be 10%, Active Defrag will perform 10ms of work on this 100ms timer cron.

This type of cycle introduces large latencies on the client (up to 25ms with default configurations)
This mechanism is subject to starvation when slow commands delay the serverCron

Maintainability: The current Active Defrag code is difficult to read & maintain. Refactoring of the high level control mechanisms and functions will allow us to more seamlessly adapt to new defragmentation needs. Specific examples include:

A single function (activeDefragCycle) includes the logic to start/stop/modify the defragmentation as well as performing one "step" of the defragmentation. This should be separated out, so that the actual defrag activity can be performed on an independent timer (see duty cycle above).
The code is focused on kvstores, with other actions just thrown in at the end (defragOtherGlobals). There's no mechanism to break this up to reduce latencies.
For the main dictionary (only), there is a mechanism to set aside large keys to be processed in a later step. However this code creates a separate list in each kvstore (main dict or not), bleeding/exposing internal defrag logic. We only need 1 list - inside defrag. This logic should be more contained for the main key store.
The structure is not well suited towards other non-main-dictionary items. For example, pub-sub and pub-sub-shard was added, but it's added in such a way that in CMD mode, with multiple DBs, we will defrag pub-sub repeatedly after each DB.

Description of the feature

Primarily, this feature will split activeDefragCycle into 2 functions.

One function will be called from serverCron to determine if a defrag cycle (a complete scan) needs to be started. It will also determine if the CPU expenditure needs to be adjusted.
The 2nd function will be a timer proc dedicated to performing defrag. This will be invoked independently from serverCron.

Once the functions are split, there is more control over the latency created by the defrag process. A new constant(DEFRAG_CYCLE_US) will be used to determine the running time for the defrag timer proc. The value for this will be 500us (one-half of the current minimum time). Then the timer will be adjusted to achieve the desired CPU. As an example, 5% of CPU will run the defrag process for 500us every 10ms. This is much better than running for 5ms every 100ms.

The timer function will also adjust to compensate for starvation. If a slow command delays the timer, the process will run proportionately longer to ensure that the configured CPU is achieved. Given the presence of slow commands, the proportional extra time is insignificant to latency. This also addresses the overload case. At 100% CPU, if the event loop slows, defrag will run proportionately longer to achieve the configured CPU utilization.

Optionally, in low CPU situations, there would be little impact in utilizing more than the configured CPU. We could optionally allow the timer to pop more often (even with a 0ms delay) and the (tail) latency impact would not change.

Addressing maintainability:

The basic code structure can more clearly be organized around a "cycle".
Have clear begin/end functions and a set of "stages" to be executed.
Rather than stages being limited to "kvstore" type data, a cycle should be more flexible, incorporating the ability to incrementally perform arbitrary work. This will likely be necessary in the future for certain module types. It can be used today to address oddballs like defragOtherGlobals.
We reduced some of the globals, and reduce some of the coupling. defrag_later should be removed from serverDb.
Each stage should begin on a fresh cycle. So if there are non-time-bounded operations like kvstoreDictLUTDefrag, these would be less likely to introduce additional latency.

Different from Valkey

We eliminate the active-defrag-cycle-us configuration and make it a constant because this configuration is not easy for users to configure, too large can result in high latency, and too low can make defragmentation very inefficient.
we add a time limit for the defrag duty cycle to prevent excessive latency. When latency is already high (indicated by a long time between calls), we don't want to make it worse by running defrag for too long.
The code structure is optimized to avoid too many global variables so that more defragmentation can be introduced more easily in the future without introducing more global variables.

Signed-off-by: Jim Brunner brunnerj@amazon.com
Signed-off-by: Madelyn Olson madelyneolson@gmail.com
Co-authored-by: Madelyn Olson madelyneolson@gmail.com

Co-authored-by: ShooterIT <wangyuancode@163.com>

sundb · 2025-02-19T11:57:36Z

becnmark

// Create fragmented host
./src/redis-benchmark -r 10000000 -n 10000000 -d 3 -t set -P 100
./src/redis-benchmark -r 9000000 -n 10000000 -d 11 -t set  -P 100
./src/redis-benchmark -r 8000000 -n 10000000 -d 19 -t set  -P 100
./src/redis-benchmark -r 7000000 -n 10000000 -d 27 -t set -P 100

// Enable defrag while running some traffic
./src/redis-cli config set activedefrag yes; ./src/redis-benchmark -r 7000000 -c 1 -n 1000000 -l -t get

Latency by percentile distribution
Before this PR:

Latency by percentile distribution:
0.000% <= 0.015 milliseconds (cumulative count 585426)
75.000% <= 0.023 milliseconds (cumulative count 996080)
99.609% <= 0.031 milliseconds (cumulative count 997317)
99.805% <= 0.047 milliseconds (cumulative count 998207)
99.902% <= 0.063 milliseconds (cumulative count 999251)
99.951% <= 0.087 milliseconds (cumulative count 999526)
99.976% <= 0.127 milliseconds (cumulative count 999769)
99.988% <= 6.095 milliseconds (cumulative count 999913)
99.994% <= 6.103 milliseconds (cumulative count 999966)
99.997% <= 6.111 milliseconds (cumulative count 999973)
99.998% <= 6.151 milliseconds (cumulative count 999985)
99.999% <= 6.159 milliseconds (cumulative count 999994)
100.000% <= 6.183 milliseconds (cumulative count 999997)
100.000% <= 6.247 milliseconds (cumulative count 999999)
100.000% <= 6.263 milliseconds (cumulative count 1000000)
100.000% <= 6.263 milliseconds (cumulative count 1000000)

this PR:

Latency by percentile distribution:
0.000% <= 0.015 milliseconds (cumulative count 935481)
93.750% <= 0.023 milliseconds (cumulative count 995146)
99.609% <= 0.047 milliseconds (cumulative count 996269)
99.805% <= 0.455 milliseconds (cumulative count 998460)
99.902% <= 0.463 milliseconds (cumulative count 999687)
99.976% <= 0.471 milliseconds (cumulative count 999949)
99.997% <= 0.487 milliseconds (cumulative count 999973)
99.998% <= 0.583 milliseconds (cumulative count 999987)
99.999% <= 0.591 milliseconds (cumulative count 999996)
100.000% <= 0.599 milliseconds (cumulative count 999998)
100.000% <= 0.607 milliseconds (cumulative count 999999)
100.000% <= 1.919 milliseconds (cumulative count 1000000)
100.000% <= 1.919 milliseconds (cumulative count 1000000)

At the same time, Add the upper limit for perodic process

Co-authored-by: ShooterIT <wangyuancode@163.com>

ShooterIT

LGTM, reviewed internally

@ShooterIT

1) Fix a bug that passing an incorrect endtime to module. This bug was found by @ShooterIT. After #13814, all endtime will be monotonic time, and we should no longer convert it to ustime relative. Add assertions to prevent endtime from being much larger thatn the current time. 2) Fix a race in test `Reduce defrag CPU usage when module data can't be defragged` --------- Co-authored-by: ShooterIT <wangyuancode@163.com>

@ShooterIT

1) Fix a bug that passing an incorrect endtime to module. This bug was found by @ShooterIT. After redis#13814, all endtime will be monotonic time, and we should no longer convert it to ustime relative. Add assertions to prevent endtime from being much larger thatn the current time. 2) Fix a race in test `Reduce defrag CPU usage when module data can't be defragged` --------- Co-authored-by: ShooterIT <wangyuancode@163.com>

In PR #13229, we introduced the ebucket for HFE. Before this PR, when updating eitems stored in ebuckets, the lack of incremental fragmentation support for non-kvstore data structures (until PR #13814) meant that we had to reverse lookup the position of the eitem in the ebucket and then perform the update. This approach was inefficient as it often required frequent traversals of the segment list to locate and update the item. To address this issue, in this PR, This PR implements incremental fragmentation for hash dict ebuckets and server.hexpires. By incrementally defrag the ebuckets, we also perform defragmentation for the associated items, eliminates the need for frequent traversals of the segment list for defragging the eitem. --------- Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

@guybe7

… timer to run twice as fast (#14081) This bug was introduced in [#13814](#13814), and was found by @guybe7. It incorrectly moved the update of `server.cronloops` from `whileBlockedCron()` to `activeDefragTimeProc()`, causing the cron-based timers to effectively run twice as fast when active defrag is enabled. As a result, memory statistics are not updated during blocked operations. The repair parts from #13995, because it needs to be backport, so use a separate pr repair it.

@guybe7

… timer to run twice as fast (redis#14081) This bug was introduced in [redis#13814](redis#13814), and was found by @guybe7. It incorrectly moved the update of `server.cronloops` from `whileBlockedCron()` to `activeDefragTimeProc()`, causing the cron-based timers to effectively run twice as fast when active defrag is enabled. As a result, memory statistics are not updated during blocked operations. The repair parts from redis#13995, because it needs to be backport, so use a separate pr repair it.

@guybe7

… timer to run twice as fast (#14081) This bug was introduced in [#13814](#13814), and was found by @guybe7. It incorrectly moved the update of `server.cronloops` from `whileBlockedCron()` to `activeDefragTimeProc()`, causing the cron-based timers to effectively run twice as fast when active defrag is enabled. As a result, memory statistics are not updated during blocked operations. The repair parts from #13995, because it needs to be backport, so use a separate pr repair it.

… expires stage (#14092) This bug was introduced by #13814 When defragmenting `db->expires`, if the process exits early and `db->expires` was modified in the meantime (e.g., FLUSHDB), we need to check whether the previously defragmented expires is still the same as the current one when resuming. If they differ, we should abort the current defragmentation of expires. However, in #13814, I made a mistake by using `db->keys` and `db->expires`, as expires will never be defragged.

sundb and others added 17 commits February 12, 2025 15:57

Refactor of ActiveDefrag to reduce latencies

492ce51

Always use context for various stages

c1c1363

Refine comment for context

c432b55

Refine comment

9d79bdc

Add license

a5f01ae

Simplify the release of stage list

7d4a3a0

dont use bool

091ed0a

Revert the change of timeEventNextId

f6971a9

Co-authored-by: ShooterIT <wangyuancode@163.com>

Revert the license change of db.c

1b53310

Revert some test changes

6786d48

Fix test

d476272

Add free fn for stage

1569913

Move defrag_later into defragKeysCtx

339da6f

Style

3e68663

Move forward ctx pointer

7f855a1

Revert the uppper limit, and eliminate the lower limit

208ba72

Revert minor change

b093f2b

sundb added the release-notes indication that this issue needs to be mentioned in the release notes label Feb 19, 2025

sundb added 3 commits February 19, 2025 20:41

Cleanup

a3ea9ac

Use constant DEFRAG_CYCLE_US to replace config active-defrag-cycle-us

5887864

At the same time, Add the upper limit for perodic process

Revert test

266d378

ShooterIT approved these changes Feb 19, 2025

View reviewed changes

sundb mentioned this pull request Feb 19, 2025

Add support for incremental defragmentation of global module data #13815

Merged

Refine the comment for DEFRAG_CYCLE_US

bef633c

Co-authored-by: ShooterIT <wangyuancode@163.com>

ShooterIT approved these changes Feb 19, 2025

View reviewed changes

sundb merged commit 725cd26 into redis:unstable Feb 19, 2025

sundb mentioned this pull request Feb 21, 2025

Fixed passing incorrect endtime value for module context #13822

Merged

sundb mentioned this pull request Mar 7, 2025

Add support to defrag ebuckets incrementally #13842

Merged

sundb mentioned this pull request May 27, 2025

Fix incorrect server.cronloops update in defragWhileBlocked() causing timer to run twice as fast #14081

Merged

sundb mentioned this pull request May 30, 2025

Fix db->expires can't be defragged due to incorrect comparison in the expires stage #14092

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of ActiveDefrag to reduce latencies#13814

Refactor of ActiveDefrag to reduce latencies#13814
sundb merged 21 commits intoredis:unstablefrom
sundb:active-defrag

sundb commented Feb 19, 2025 •

edited

Loading

Uh oh!

sundb commented Feb 19, 2025 •

edited

Loading

Uh oh!

ShooterIT left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sundb commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue/Problems

Description of the feature

Different from Valkey

Uh oh!

sundb commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShooterIT left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sundb commented Feb 19, 2025 •

edited

Loading

sundb commented Feb 19, 2025 •

edited

Loading